If you are wondering what is so interesting about generative adversarial networks (GAN), please refer to the following link. In this article we dive further into the depths of GAN’s and understand how this technique works.
There have been a lot of improvements in generative adversarial networks (GAN’s) over time, but let’s go to the origin of it all in order to understand the concept.
Have you heard about the art forger Mark Landis?
It is quite an interesting story. He had been responsible for submitting forgeries to several art museums and got even better at making them over time. He often donated his counterfeits to these museums with doctored documents and even dressed as a priest to avoid suspicion. Leininger (curatorial department) was the first person to pursue Landis. You can read more about it here. But for the purpose of explaining this concept we need limited knowledge of this event.
Imagine that you are in-charge (Leininger) of identifying if the presented painting is fake or authentic. Further, Landis is also making his first forgery.
At first, you find it easy to identify a fake. However, over time both Landis and you get better. Landis develops more sophisticated skills, making it increasingly difficult for you to spot fakes.
To connect with the example in the previous section, consider the generator as Landis and discriminator as Leininger. However, here both the discriminator and generator are different neural networks which are both trying reduce their error.
The generator is trying to generate an output that fools the discriminator while the discriminator is trying to differentiate between actual and fake data. In other words, generative adversarial networks (GAN) is inspired by the zero-sum non-cooperative game where the generator is trying to maximize the number of times it fools the discriminator while the discriminator is trying to minimize the same.
These networks use back-propagation to reduce the error.
In other words, the generator and discriminator are two adversaries or opponents playing a game. They go back and forth against each other, improving their skill over time.
The loss function consists of two parts:
> Generator's Loss + Discriminator's Loss > Loss while identifying real data points + Loss from generated / fake data.
$ \min_G \max_D V(D, G)= \mathbb{E}_{x\sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z\sim p_z(z)}[\log(1 – D(G(z)))] $
Let’s start with the Generator’s loss.
Let D be the discriminator and G the generator.
To start with, consider G(z), which is the output of the generator neural network for the noise input z. Note that we have randomly sampled the noise from a probability distribution. The generator based on the weights it has learned is able to transform the noise into hopefully something more meaningful.
D( G(z) ) is the discriminator using the output of the generator as an input.
In other words, at this point we are trying to find out the probability that a fake instance (forgery) is real.
$ \mathbb{E}_{z\sim p_z(z)}[\log(1 – D(G(z)))] $
The above mentioned part can be summarized in the following points:
The second part of the loss function is rather simple.
$ \mathbb{E}_{x\sim p_{data}(x)}[\log D(x)] $
In this article we have not explored certain concepts in too much detail such as:
Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.
About the writers:
Ankit Gadi: Driven by a knack and passion for data science coupled with a strong foundation in Operations Research and Statistics has helped me embark on my data science journey.
]]>The buzz around Deep-fakes has reached far and wide, further it has been a candidate of conversation for several months. Let’s understand what the buzz is all about.
Let’s look into the greatest hits and the most impressive applications by GANs (Generative Adversarial Networks), before we take a deep dive into the depths of the algorithm.
This type of image synthesis is a form of conditional GAN’s which have been known for several applications.
The list above is just a preview of some of the applications of generative adversarial networks (GAN). We at data science discovery, also felt inspired and started on our journey to discover this concept and dive into the depths of this topic.
I looked at several whitepapers to get familiar with this topic. Further, one of my fellows, Navin Manaswi, who at the time was working on a new book “Generative Adversarial Networks with Industrial Use Cases” helped out by sharing some of the chapters he had written.
Me and my colleague also decided to try and experiment with this technique on one of our ongoing projects. In most examples that we have seen, it is evident that it works well on images but what about structured data. However, that is a whole another story.
In the next article we dive deeper into the concept by building an intuition and learning about the architecture of these neural networks.
Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.
About the writers:
Our focus is to solve text classification problem using deep learning. To reiterate the problem of NLP (Natural Language Processing) based text classification below:
In today’s world, websites have to deal with toxic and divisive content. Especially major websites like Quora which cater to large traffic and their purpose is to provide a platform to people for asking and answering questions. A key challenge is to weed out insincere questions, those founded upon false premises or questions that intend to make a statement rather than look for helpful answers.
A question is classified as insincere if:
For more information regarding the challenge you can use the following link.
The full deep learning code used is available here.
In part 1, we saw how machine learning algorithms could be applied to text classification. We had to identify and create a variety of features to reduce the complexity in the data. This required considerable effort but was essential for the learning algorithms to be able to detect patterns in the data.
In this section, we will approach the same problem using deep learning techniques. Deep learning has become the state of the art method for a variety of classification tasks. But first thing we need to understand is the motivation for the same, which has been outlined here:
That all sounds great, but what is the difficult part of this task. Well the major problem is defining the right architecture. But how do we get started, our primary task should be to understand the type of neural network we wish to use and further the architecture of the neural network. In other words we need to make a choice based on the following parameters:
Let’s understand some key concepts before we proceed further:
Words represented as real valued vectors are what we call word embeddings. The value associated with each word is either learned by using a neural network on a large dataset with a predefined task (like document classification) or by using an unsupervised process such as using document statistics. These weights can be used as a part of transfer learning to take advantage of the reduced training time and better performance.
Consider how we make sense of a sentence, we not only look at a word but also how it fits with the preceding words. Recurrent neural networks (RNN) take into consideration both the current inputs as well as the preceding inputs. They are suitable for sequence problems because their internal memory stores important details about the inputs that they received in the past which helps them precisely predict the output of the next time step.
GRU and LSTM are improved variations of the vanilla RNN, which tackle the problem of vanishing gradients and handling of long term dependencies.
Consider a simple use case that we are trying to infer the weather based on the conversations between two people. Let’s take the following text as an example
“We were walking on the road when it started to pour, luckily my friend was carrying an umbrella.”
The target variable here has a classification of “rain”. In such a case RNN has to keep “pour” in memory while considering “umbrella” to correctly predict rain. As the occurrence of umbrella alone is not definitive proof of rain. Similarly the occurrence of “didn’t” before “pour” would have changed everything.
A bidirectional RNN first forward propagates left to right, starting with the initial time step. Then starting at the final time step, it moves right to left until it reaches the initial time step. This learning of representations by incorporating future time steps helps understand context better.
In other words let’s go back to our example of deciphering the weather based on conversations. It makes sense to make a leap from “pour” to “umbrella” starting reading from left to right. But what if we went right to left, that will just add to the power of the model as may be in another conversation we have a different occurrence pattern of words for example:
“I took out an umbrella as it started to pour.”.
I think the best way to describe attention is by having a look at a basic CNN use case of image classification (Dogs vs Cats). If you are given an image of a dog what is the most defining characteristic that helps you differentiate. Is it the dog’s nose or ears? The attention mechanism blurs certain pixels and focuses on only a portion of the image. Thus, it assigns a weight and tells the model what to focus on.
In our context, attention model takes into account input from several time steps back and assigns a weight, signifying interest, to each of them. Attention is used to pay attention to specific words in the text sequence for example over a large dataset the attention layer will give more weightage to words like “rain”, “pour”, “umbrella” and so on.
We used Stochastic weight averaging to update the weights of our network for the following reasons-
The image below (Illustrations of SWA and SGD with a Pre-activation ResNet-164 on CIFAR-100) shows how well SWA generalizes and results in better performance. On the left we have W1, W2 and W3 as weights of three independently trained networks and Wswa is the average of the three. This holds even though on the right we see a greater train loss by SWA as compared to SGD.
The essence of this learning rate policy comes from the observation that increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial effect. This observation leads to the idea of letting the learning rate vary within a range of values rather than adopting a stepwise fixed or exponentially decreasing value. That is, one sets minimum and maximum boundaries and the learning rate cyclically varies between these bounds based on a predefined function. It is possible that during gradient descent we are stuck at the local minimum but a cyclical learning rate can help jump out to a different location moving towards the global optimum.
This a technique specifically used to prevent over fitting. It basically involves dropping some percentage of the units during the training process. By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections. The units are selected randomly. Dropout randomly zeros out activations in order to force the network to generalize better, do less overfitting, and build in redundancies as a regularization technique.
There are different ways to drop values. Think of pixels in an image, these pixels will be very correlated to its neighbours, in such a case randomly dropping a pixels will not accomplish anything. That is where techniques like spatial dropout come into the picture. Spatial dropout involves dropping entire feature maps. For explanation purposes consider an image cropped into smaller segments and each being mapped using a function. Out of all these mapped values we randomly delete some of them.
If we take textual data you can think of it as dropping entire phrases and forcing the model to generalize better.
Rather than having varying ranges in your data we often normalize the data set to allow faster convergence. The same principle is used in neural networks for the input of the hidden layer. It involves a covariance shift which helps the network generalize better. That means even if the value in the train set and test set are vastly different, by normalizing it we reduce overfitting and help get better results.
Global average and global max pooling reduce the spatial size of the feature map/ representation to one feature map for each category (classification task).
Let’s say that we have an image of a dog. Global average pooling will take the average of all the activation values and tell us about the overall strength of the image, i.e. whether the image is of a dog or not.
Global max pooling, on the other hand, will take the maximum of the activation values. This will help identify the strongest trait of the image, say, ears of the dog. Similarly, in the case of textual data, global max pooling highlights the phrase with the most information while global average pooling indicates the overall value of the sentence.
Let’s dive deeper into the choices made and connect the dots between understanding the components of a neural network to actually forming one. Well, one of the most important factors that comes into play while deciding on the architecture is past experience and experimentation.
We first started with a very basic model (GRU) and plotted the accuracy and loss. We noticed during our experimentation that given the nature of our dataset it was very easy to overfit. Observe the graph below obtained on using a bidirectional GRU with 64 units/neurons and a single hidden layer (16 units):
Clearly, we start to overfit very quickly and we cannot keep the number of epochs high. Similarly, it also suggests that a very complex model (One with multiple layers and several nodes) will also be prone to overfitting. This also suggests the importance of regularization techniques like dropout in the architecture.
We also carried out error analysis to further get an idea of the performance of the baseline model.
We intended to understand the following:
What are insincere topics where the network strongly believes to be sincere ?
class 0 [0.94] Should we kick out Muslims from India? class 0 [0.94] Why does Quora keep posting this leftist propaganda? Are they owned by a liberal media conglomerate? class 0 [0.94] Why don’t the Christian and Muslim quarter be under Palestinian control and the Jewish quarter be under Israeli control?
Our baseline model has shown an F1 score of 0.63 yet these sentences are being tagged as sincere. What is the issue?
Post a deep dive into the dataset we noticed that there are several misclassified cases in the dataset. In the competition details they have specified that the method used for classification is a combination of manual and some ML/rule based efforts. This also brings our attention to a real world scenario that can exist in any ML/DL project. Ideally, we would like to reclassify such cases with the wrong tag as the model is currently learning from misclassified examples as well. However, that can not be done given the constraints of the competition.
What are sincere topics where the network strongly believes to be insincere ?
class 1 [0.00000001] What astrological combinations are needed to obtain a scholarship for studies? class 1 [0.00000001] How many karats needed for a karat farm? class 1 [0.0000001] How many SITA devices are present in Ramayana?
What are insincere/sincere topics where the network are most uncertain how to classify ?
Keep in mind our chosen threshold for converting probabilities into a binary is: 0.4
class 0 [0.40] Why do most people not consider the US a two party system, even though any party besides Democrat/Republican realistically has no chance for a win? class 0 [0.40] Feigning ignorance of who wrote it, could you have mistaken "Something" for Paul McCartney number if he had sung it instead of George? class 1 [0.399] Why can't we make radiation death rays with huge spread and nearly infinite range to nullify Western air power once and for all and put a stop their invasions of our sovereignty? class 1 [0.399] What country is more tolerant to gays, Saudi Arabia or USA?
What are insincere topics where the network strongly believes correctly ?
class 1 [0.99] Why are the Quora moderators f*cking ***holes? class 1 [0.98] Why is it that Quora is so f*cking liberal and stupid?
This gives us a general idea of what is happening and help us develop some intuition of where the model shines and where it goes wrong. In particular we also get to see some words that might be clearly tipping the classification scales in a particular direction.
Let’s start with the input data, we have used an average of the pre-trained GloVe and Paragram word embeddings. The reason for this choice was that Glove and Paragram were found to contain above 80% of our corpus. Further, taking an average of these embeddings was giving better results on our baseline model.
Spatial Dropout is being used immediately after the embeddings. This makes sure are model is more robust during training and prevents over-fitting.
Following which we have used a bidirectional LSTM layer, with 40 units. We decided to split this model into two pathways:
The best way to look at this is like we have made two branches, the prior (LSTM – Attention) maintains the simplicity and the latter branch allows the model to learn more by having an additional layer. The reason we selected 40 units is mostly based of experimentation and intuition, we noticed that by having a larger number of units the model started to over fit almost immediately.
In the latter branch, the output of the 2nd bidirectional LSTM layer was being used for three operations, namely, an attention layer, global average pooling and global max pooling. Now, each of these layers bring forth diverse features from the data and contain 80 units each.
All these outputs (from both prior and latter branches) are concatenated (as in the concatenation of 4 outputs we have 320 units) and fed into a layer of 32 units, with a RELU activation. This is followed by batch normalization and dropout to speed the computation and help reduce over-fitting. After this, we have the final output later with a sigmoid activation function.
Kaggle had set the evaluation metric to be the F1 score. This was a suitable choice, instead of accuracy, because of the class imbalance present in the dataset. Moreover, due to some of the questions being labelled incorrectly, techniques used to handle class imbalance, such has undersampling and oversampling, might actually increase the incorrectly labelled questions or decrease the correctly labelled ones. Further, computational constraints were another important factor to keep in mind while making any decision.
Our approach for model validation included creating a train and validation data set. We ran 10 epochs which is the maximum we could run with this model as post this we would start overfitting or violate computational constraints. We started SWA from the 4th epoch as at that point the F1 score had already reached close to 0.65. Thus, it was good point to start the process.
Once we had the final predictions from the model we used a threshold to binarize the probabilities which was obtained on the basis of the validation dataset.
Kaggle’s score calculations process involved only 15% data for the public leaderboard and remaining for the private leaderboard. Our final model returned a score of 0.68 on the public leaderboard and around 0.68875 in the private leaderboard. This stability in the score was a good demonstration of a good generalized model.
Here, is a look at the confusion matrix:
The traditional ML approach yielded a score of 0.583 as compared to the deep learning model’s score of 0.68.
While the deep learning model clearly outperformed the traditional ML stacking, there are a few points to consider before you set the course of your text classification problem:
Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.
About the writers:
Purpose:
Natural language processing (NLP) has been widely popular, with the large amount of data available (in emails, web pages, sms) it becomes important to extract valuable information from textual data. An assortment of machine learning techniques designed to accomplish this task. With current advances in deep learning, we felt it would be an interesting idea to compare traditional and deep learning techniques. We decided to pick up a playground kaggle data set with the purpose of text classification and proceeded to implement both these types of algorithms for comparison purposes.
In today’s world, websites have to deal with toxic and divisive content. Especially major websites like Quora which cater to large traffic and their purpose is to provide a platform to people for asking and answering questions. A key challenge is to weed out insincere questions, those founded upon false premises or questions that intend to make a statement rather than look for helpful answers.
A question is classified as insincere if:
For more information regarding the challenge you can use the following link.
The full code is available here.
In this article we will tackle text classification by using machine learning and NLP techniques. For any data science problem with textual data the common steps include:
Let’s explore them step by step in more detail.
One of the most important steps of any project, you need to familiarize yourself with the data prior to implementing any modeling technique.
import os print(os.listdir("../input"))
Our dataset includes:
We are not allowed to use any external data sources. The following embeddings are given to us which can be used for building our models.
A look at the size of our train and test data:
In the target variable 1 represents the class Insincere and 0 the Sincere class of questions.
Let’s explore the distribution of the target variable:
import seaborn as sns color = sns.color_palette() %matplotlib inline from plotly import tools import plotly.offline as py py.init_notebook_mode(connected=True) import plotly.graph_objs as go cnt_srs = train_df['target'].value_counts() ## target distribution ## labels = (np.array(cnt_srs.index)) sizes = (np.array((cnt_srs / cnt_srs.sum())*100)) trace = go.Pie(labels=labels, values=sizes) layout = go.Layout( title='Target distribution', font=dict(size=18), width=600, height=600, ) data = [trace] fig = go.Figure(data=data, layout=layout) py.iplot(fig, filename="usertype")
These box plots shared below will help understand if there are any patterns in the dataset regarding the word count or the number of characters.
Per question Insincere questions have more words
Insincere questions > characters than sincere questions
Sincere questions have lesser punctuation’s
More upper case words in sincere questions
For questions classified as sincere we see general words like “will”, “one” and so on. We also see the word “will” prevalent for insincere questions. During the data processing steps we will have to treat common words. Another point brought out in the word cloud is how words like “Trump”,”liberal” are very specific to insincere words, possibly because the person is making a statement about these topics rather than genuinely providing an answer.
Usually unstructured text data will be dirty that is it will have misspelled words, case-insensitive words and various other issues. We need to clean the text and bring it to a standardized form before extracting information from it as without this step there will be noise resulting in a poor model.
Broadly, consider the following steps:
Tokenization refers to the splitting of strings of text into smaller chunks or tokens. Paragraphs or large bodies of text are tokenized into sentences and then sentences are broken down into individual words.
This refers to a series of steps that transforms the corpus of text into a single standard and consistent form. The following steps are a part of this process:
Stemming, which involves chopping off the end of a word or inflectional endings (-ing, -ed etc.) to get its root form or stem, using crude heuristic rules.
burning -> burn.
Stemming generally works well most of the time, but can often return words which might not look correct intuitively.
difficulties -> difficulti
Lemmatization has the same goal as stemming. However, it uses a vocabulary and the morphological analysis of words, to remove inflectional endings and return the dictionary form of a word, known as the lemma. Unlike stemming, lemmatization aims to reduce the word properly so that it makes sense according to the language.
ran -> run, difficulties -> difficulty
The idea for stemming or lemmatization of words is to reduce words into a common form. For example, difficulties and difficulty will portray the same intent and context.
For our use case we have performed the following operations to clean the data (using the library NLTK):
import nltk from nltk.tokenize import word_tokenize from nltk.tokenize import sent_tokenize from nltk.corpus import stopwords from nltk.stem import PorterStemmer from nltk.stem.wordnet import WordNetLemmatizer #lower case all_data['question_text'] = all_data['question_text'].apply(lambda x: " ".join(x.lower() for x in x.split())) #Removing Punctuation all_data['question_text'] = all_data['question_text'].str.replace('[^\w\s]','') #Removing numbers all_data['question_text'] = all_data['question_text'].str.replace('[0-9]','') #Removing stop words and words with length <=2 from nltk.corpus import stopwords stop = stopwords.words('english') all_data['question_text'] = all_data['question_text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop and len(x)>2)) # Lemmatize from nltk.stem import WordNetLemmatizer wl = WordNetLemmatizer() all_data['question_text'] = all_data['question_text'].apply(lambda x: " ".join(wl.lemmatize(x,'v') for x in x.split()))
This part is what makes the difference between a good and a bad solution in any ML project. So what features can we create in our usecase. We can start with understanding the sentiment.
Sentiment is a part of opinion mining and it involves building a system to extract the opinion from a text. That is we wish to get a score to understand how positive or negative the text is.
The assumption with respect to our data set was perhaps the questions flagged as Insincere may contain toxic content and would exhibit a negative sentiment. However, as far as modeling features are concerned sentiment turned out to be a weak feature. On deeper evaluation, we noticed that there were several questions with high polarity scores with insincere and sincere tags.
Topic modelling is an approach to identify topics present across a corpus of text. A topic is defined as a repeating pattern of co-occurring terms in a corpus. A document contains multiple topics in varying proportions. So, for example, a document based on healthcare is more likely to contain a higher ratio of words like “doctor” and “surgery” than words such as “brakes” and “gear”, which indicate a theme of automobiles.
Using a technique like Latent Dirichlet Allocation to get the distribution of topics across the corpus would potentially help to get a sense of the themes discussed in the set of questions. Further, we hypothesize that there would be some difference between the topics of sincere and insincere questions.
The image below shows the distribution of the topics with respect to the different classes by taking an average.
Looking up top words from top topics from class insincere:
Looking up top words from top topics from class sincere:
Countvectorizer returns a matrix which shows the frequency of each term in the vocabulary per document. On the other hand, tf-idf (term frequency-inverse document frequency) evaluate how important a word is to a document in the corpus.
tf(x) = (Number of times term x appears in a document) / (Total number of terms in the document)
idf(x) = log(Total number of documents / Number of documents with term x in it)
tf-idf = *tf(x) * idf(x)*
Clearly, the importance of a word in a document increases proportionally to the number of times a it appears there. But, it is offset by the number of times it occurs in the corpus.
Both tf-idf and countvectorizer, as features, may indicate the relevance of a certain set of words to questions labelled as “Sincere” as well as “Insincere”.
The image below is obtained by using a TF-IDF vectorizer to create features and a k-fold CV logistic regression model and it shows the words (of insincere questions) with most weight.
The idea behind building features such as the number of unique words, characters or exclamation points is to check for uniformity in the data set. We wish to observe if there are some similarities between the train and test set. Some questions that these meta features help answer include:
The examples mentioned above give us the idea that there might be certain patterns specific to the respective classes that can be leveraged in our model. To give an ad hoc example of how useful meta features can be, on a musical note, the number of words per minute for Eminem is different based on the content/emotion of the song.
Some of the meta features are listed below:
We have added some box plots in the data exploration section to provide you with an idea regarding the prevalent distribution with respect to the different classes.
So far we have cleaned up our text and carried out feature engineering. Now, there are several ways to select the relevant features however, for the purpose of this article we decided to generate separate models for each set of features as this will help develop a general understanding and help utilize these tactics on other text classification datasets.
A few things to note:
There are two pieces of code that will be reused in most of the models:
kf = KFold(n_splits=5, shuffle=True, random_state=43) ## Initialize 0’s test_pred_ots = 0 oof_pred_ots = np.zeros([train.shape[0],]) train_target = train['target'].values x_test = test[selected_features].values ## Loop to split the data set for i, (train_index, val_index) in tqdm(enumerate(kf.split(train))): x_train, x_val = train.loc[train_index][selected_features].values, train.loc[val_index][selected_features].values y_train, y_val = train_target[train_index], train_target[val_index] # Model classifier = LogisticRegression(C= 0.1) classifier.fit(x_train, y_train) ## Validation set predicted val_preds = classifier.predict_proba(x_val)[:,1] ## Test set predictions preds = classifier.predict_proba(x_test)[:,1] test_pred_ots += 0.2*preds oof_pred_ots[val_index] = val_preds print("--- %s seconds for Model Selected Features ---" % (time.time() - start_time))
The code above runs 5 fold cross validation and with each split we train and make predictions on the validation and test datasets. At the end of all splits we get oof_pred_ots which are predictions on the validation data sets combined into a single data frame. We also get the average prediction probabilities of each split in test_pred_ots.
thresh_opt_ots = 0.5 f1_opt = 0 for thresh in np.arange(0.1, 0.91, 0.01): thresh = np.round(thresh, 2) f1 = metrics.f1_score(train_target, (oof_pred_ots.astype(float) >thresh).astype(int)) #print("F1 score at threshold {0} is {1}".format(thresh, f1)) if f1_opt < f1: f1_opt = f1 thresh_opt_ots = thresh print(thresh_opt_ots) pred_train_ots = (oof_pred_ots > thresh_opt_ots).astype(np.int) f1_score(train_target, pred_train_ots)
The code above will help find the best threshold.
First Model:
We used the text descriptive features and ran a 5-fold cross validation logistic regression model however, the F1 score is not that significant (0.27).
Second Model:
We used the sentiment and topic modeling features and ran the same model as mentioned before. This time we got a better score (0.34).
Third Model:
We used TFIDF features and tried logistic regression (F1 – 0.587) and light gbm (F1 – 0.591). This is much better.
Fourth Model:
We used countvectorizer and tried logistic regression (F1 – 0.592) and a multinomial (F1 – 0.55) and bernoulli (F1 – 0.53) naive bayes models.
The idea here is that one model might be observing patterns that the other isn’t. Further, ensemble will help get better results and at the same time reduce the chance of over fitting. We used stacking which means that we make predictions on the entire train set. This is accomplished by splitting data at each folds into train and holdout set and making predictions on the holdout set. This splitting of the data is carried out such that there is a prediction for each row in the train data set.
We use these new predictions from the respective models as input variables and run another (logistic regression) model on top of this giving us the final probabilities.
Our final F1 Score (0.604) and on the leader-board (0.589).
In the next article we will implement a deep learning approach to the same use case and draw comparisons between the two methodologies.
References:
Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.
About the writers:
Our team carries out an in-depth breakdown (deep dive) of complex topics such as NLP, Deep Learning, UMAP and several others.
We also implement challenging projects, providing detailed explanations of the decisions and process involved. This section will be updated as we add more content to our blog. Please find the links to the respective topics below:
What is this mythical beast I keep hearing about? Today, Deep Learning is a buzzword for a well deserved reason. Let’s do a deep dive into this subject and slay this beast. [Read More]
The buzz around Deep-fakes has reached far and wide, further it has been a candidate of conversation for several months. Let’s understand what the buzz is all about and learn more about generative adversarial networks. [Read More]
This series on Natural Language Processing is designed with the idea to start from scratch and slowly make our way to the state of the art models we hear about today. Layer by layer we will develop the necessary concepts and implement the same to strengthen our foundation. [Coming Soon]
In today’s world, websites have to deal with toxic and divisive content. Let’s try to implement a text classification exercise. We would try multiple traditional algorithms and also implement some of the latest deep learning models. [Read More]
What to do when your data has too many variables? Can I visualize the data? Discover dimension reduction techniques including PCA, UMAP and others. [Read More]
There is a lot of research around this topic, diverging into a plethora of different techniques ranging from econometric models, time-series models to even deep learning models. It becomes really difficult to understand what to implement and if it will work. We will implement a time series and deep learning technique and guide you through our steps and decisions. [Coming Soon]
]]>Understand the concept of Gradient Descent and Back-propagation to get some idea of how Neural Networks work.
In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.
We have already covered some of the basics of the architecture and the respective components in the previous posts. But we need to understand one of the most important concepts.
How do Neural networks exactly work?
How are the weights updated in Neural networks?
Well, let’s get into the algorithms behind Neural Networks.
For most machine learning algorithms, optimization is used to minimize the cost/error function. Gradient Descent is one of the most popular optimization algorithms used in Machine Learning. There are many powerful ML algorithms that use gradient descent such as linear regression, logistic regression, support vector machine (SVM) and neural networks.
Intuition
Let’s take the classic mountain valley example with a twist, you meet a pirate and in your travels you discover a map to the golden chalice of wisdom. The secret location is the lowest point in a very dark and deep valley. Given that there is no possible sources of natural or artificial light in this magical valley, both the pirate and you are in a race to reach the bottom of the valley in pitch darkness. The pirate decides to take steps forward randomly with the hope of eventually reaching the lowest point.
Both of you have the same starting point, you think there must be a smarter way. At every step you decide to feel the gradient (slope) around you, and take the steepest step possible. By taking the best possible step every time, you win!
That is analogous to the gradient descent technique. We are operating in the blind trying to take a step in the most optimal direction.
Let us say that we fit a regression model on our dataset. We need a cost function to minimize the error between our prediction and the actual value. The plot of our cost function will look like:
Gradient is another word for slope and the first step in gradient descent is to pick a starting value at random or set it to 0. Now, a gradient has the following characteristics:
Let’s take a mathematical function to further understand the same.
In mathematical terms, if our function is:
$
f(x) = e^{2}\sin(x)
$
The derivative:
$
\frac {\partial f}{\partial x} = e^2\cos(x)
$
If x = 0
$
\frac{\partial f}{\partial x} (0) = e^2 \approx 7.4
$
So when you start at 0 and move a little (take a step), the function changes by about 7.4 times (magnitude) the amount that you changed. Similarly, if you have multiple variables we take partial derivatives:
$
z = f(x,y) = xy + x^2
$
For a function such as the one above we first take y as a constant and follow differentiate it in terms of x ( Here: y + 2x). Then we take x as a constant and take the derivative in terms of y (Here: x). Consider if x = 3 and y = -3 then f(x,y) = 9. The final value is obtained from the use of the chain rule of calculus.
$\nabla f $
the sign of the final gradient points in the direction of greatest change of the function.
In a feed-forward network, we are learning how does the error vary as the weight is adjusted. The relationship between the net’s error and a single weight will look something like the image below (we will get into more detail a little later):
As a neural network learns, it slowly adjusts several weights by calculating (dE/dw) the derivative of network Error with respect to the weights.
Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point. For example:
Metric | Value |
---|---|
Gradient Magnitude | 2.5 |
Learning Rate | 0.01 |
Then the gradient descent algorithm will pick the next point 0.025 away from the previous point. A small learning rate will take too long and a very large learning rate the algorithm might diverge away from the minimum point (miss the minimum completely).
Finally, the weights are updated incrementally after each epoch (pass over the training dataset) till we get the best results.
In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration. So far, we have assumed that the batch has been the entire data set. But for large datasets, the gradient computation might be expensive.
stochastic gradient descent offers a lighter-weight solution. At each iteration, rather than computing the gradient ∇f(x), stochastic gradient descent randomly samples i at uniform and computes ∇fi(x) instead.
Back-propagation is simply a technique or method of updating the weights. We are aware of partial derivatives, chain rule and most importantly gradient descent. But with Neural networks having multiple layers and different activation functions make it difficult to visualize how everything comes together. Consider, a simple example with the following architecture:
Step 1: Initialization Let us initialize the weights and the bias.Table 1 a: Weight Initialization Example
Weights | Value |
---|---|
w1 | 0.10 |
w2 | 0.15 |
w3 | 0.03 |
w4 | 0.08 |
w5 | 0.18 |
w6 | 0.06 |
w7 | 0.11 |
w8 | 0.26 |
Table 1: Dominated/Non-Dominated Example
Bias | Value |
---|---|
b1 | 0.05 |
b2 | 0.42 |
Assume take the initial input values to be [0.95,0.06] and the target value [0.05,0.82].
Step 2: Calculations
To get the value of H1:
H1 = w1 * x1 + w2 * x2 + b1 = 0.1 * 0.95 + 0.15 * 0.06 + 0.05 = 0.154
As we have a sigmoid activation function:
$
\frac{1}{1+e^{-X}}
H1 = \frac{1}{1+e^{-H1}} = \frac{1}{1+e^{-0.154}} = 0.538
$
Similarly, we can calculate H2.
H1 = 0.538 and H2 = 0.52
Now we calculate the value for output nodes Y1 and Y2.
Y1 = w5 * H1 + w6 * H2 + b2 = 0.18 * 0.538 + 0.06 * 0.52 + 0.42 = 0.548
$
Y1 = \frac{1}{1+e^{-Y1}} = \frac{1}{1+e^{-0.548}} = 0.633
$
Upon calculation:
Y1 = 0.633 & Y2 = 0.648
Step 3: Error Function Let the error function be:
$
J( \theta ) = {( {target – {output}})^2}
$
Total Error (E) = E1 + E2 = 0.184972 E1 = 0.5 * (0.05 - 0.63368)^2 = 0.17 E2 = 0.5 * (0.82 - 0.64893)^2 = 0.014
Back-propagate the Errors to update the weights.
Error at W5:
$
\partial E \over \partial W5
$
$
= ({\partial E \over \partial output Y1}) * ({\partial output Y1 \over \partial Y1}) * ({\partial Y1 \over \partial W5})
$
Component 1: The Cost/Error Function
target: T output: out E = 0.5 * (T1 - out Y1)^2 + 0.5 * (T2 - out Y2)^2 Differentiating: - (T1 - out Y1) = - (0.05 - 0.63368) = 0.58368
Component 2: The Activation function
output: out out Y1 = 1/(1 + exp(-Y1)) Differentiating: out Y1 * (1 - out Y1) = 0.63368 * (1 - 0.63368) = 0.23213
Component 3: The Function of Weights
Y1 = w5 * H1 + w6 * H2 + b2 Differentiating: H1 * 1 = 0.538
Finally, we have the change in W5:
$ \partial E \over \partial W5
$
=0.58368∗0.23213∗0.538
=0.07289
In order to update W5 recall the discussion on gradient descent. Let alpha be learning rate with a chosen value of 0.01.
Updated W5 will be:
$
W5 + \alpha * ({\partial E \over \partial W5})
$
=0.18+0.01∗0.07289
=0.1807289
Similarly, we can update the remaining weights. Let’s have a look at the formula to update W1:
\frac{\partial E}{\partial w1}
equals
$
(\sum\limits_{i}{\frac{\partial E}{\partial out_{i}} * \frac{\partial out_{i}}{\partial Y_{i}} * \frac{\partial Y_{i}}{\partial out_{h1}}}) * \frac{\partial out_{h1}}{\partial H1} * \frac{\partial H1}{\partial w_{1}}
$
It feels like it is complicated, but really we are going back layer by layer to get the respective value. As w1 feeds into neuron H1 and H1 is connected to Y1 and Y2. Moving backwards, we are differentiating the error function following which Y1 and Y2 (the activation function and the function of Weights) . That leads us to H1 where we differentiate its activation function and its respective function of weights.
This is how we back-propagate the errors and update all the weights. Once we update all the weights, that is one epoch or pass over the dataset. Further, we start the entire process of forward pass and backward pass again. This process is repeated for multiple times with the purpose of minimizing error.
When do we stop?
We stop prior to over-fitting that is we want the minimum validation error but we do not want the training error to be lower than the validation error.
Hopefully, this explains the entire process of how neural networks actually work and sheds some light on gradient descent and back-propagation.
Activation: We have talked about activation functions in the past posts, but let’s understand in more detail the different types of activation functions and explore their characteristics.
]]>What are Neural Networks made of? Understanding the different components and the architecture of Neural Networks.
In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.
The introduction to neural networks and a general idea behind the inspiration for such an algorithm has been discussed in the previous post. We will talk about the building blocks of neural networks in detail in future posts, but in this post we focus on the overall structure of Neural networks and discuss some of the components.
Now, let’s briefly discuss the elements of a neural network.
Neural networks are a set of algorithms, inspired by the working of the human brain. These algorithms are designed to recognize patterns. Neural networks consist of layers which are made of nodes. These nodes are where all the calculations happen.
Each input has its own relative weight. Weights are adaptive coefficients that determine the intensity of the input signal as registered by the artificial neuron. Using techniques like back-propagation discussed here, the weights are updated with each iteration in order to reduce the error. For now, all we need to know is that the weights will be updated using special algorithms and that these algorithms require differentiation. So weights will be updated overtime but when we start training a neural network but:How do we initialize the weights?
He-et-al Initialization In this method, the weights are initialized while keeping in mind the size of the previous layer. That is we are taking into account the number of neurons in the previous layer. This helps attain a global minimum of the cost function faster.The weights are still random but differ in range. This initialization is more controlled here. More details about this technique are available here.
There are several techniques which can be used for initialization but the techniques mentioned here will give you some idea of how the weights as a component fits in neural networks.
A neural network is the grouping of neurons into layers and there can be many layers between the input and output layer. Most applications require networks that contain at least the three layers – input, hidden, and output. Each neuron in the hidden layer will be connected to all the neurons in the previous layer. We can start with these two types of basic perceptrons. They feed information from the front to the back and therefore are called Feed Forward networks.
Single Layer Feed Forward Neural Network consists of a single layer, that is it will only have the input and output layer. A single-layer perceptron can only be used for very simple problems such as classification classes separated by a line.
Multi Layer Feed Forward Neural Network consists of one or more hidden layers, whose computation nodes are called hidden neurons or hidden units. A Multilayer Perceptron can be used to represent convex regions thus it can separate and work in some high-dimensional space.
Now, we know it makes sense to have multiple layers especially when dealing with images or complex data.
How do we decide on what architecture to use? How many hidden layers should be used?
There has to be a trade-off and there is no definite answer to this question. However, I can suggest you the following:
Experimentation: Find out what works best for your data given the computational constraints.
Intuition or Google: Based on experience of past models used you can come up with an answer. If you have a standard DL problem such as an image classification, you can Google to find out what others have used (Resnet,vgg and so on).
Search: Try random or grid search for different architectures and choose the one giving the best score.
There are several different architectures shown in the image below. To summarize what are the parameters that govern or define the architecture:
Inside the Black box: What is going on inside this Black box algorithm? Trying to build intuition and understanding of what is going on in the different layers of a neural network. Let’s continue with the learning in this next article where we take a closer look at what happens with the different neurons and respective layers of a Neural Network.
]]>Don’t get alarmed, we are going to put what we have learnt into practice on a playground kaggle data set explaining the code along the way.
In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.
This was coded sometime back and utilizes the library Fastai version 0.7, however recently there have been some updates in the library and new releases in pytorch as well. The current code will no longer work with Fastai v1, while there are still some important concepts that can be learned from this code such as:
We have covered some basic concepts regarding what neural networks are and how do they work. However, I feel it has been too much theory and while learning any new concept it is also important to see that theory in action. Let’s start!!!
Let’s pick up a playground problem from Kaggle. Invasive species can have damaging effects on the environment, the economy, and even human health. Consider, tangles of kudzu that overwhelm trees in Georgia while cane toads threaten habitats in over a dozen countries worldwide. This means it is a very important to track and stop the spread of these invasive species. Think of how costly and difficult it will be to undertake this task at a large scale. Trained scientists would be required to visit designated areas and take note of the species inhabiting them. Using such a highly qualified workforce is expensive, time inefficient, and insufficient since humans cannot cover large areas when sampling.
Looks like a very interesting use case for Deep Learning.
What we need is a labeled dataset of images marked as invasive or safe. Our algorithm will take care of the rest. You can start a kernel (python jupyter notebook) using this link and follow along. Few settings to keep in mind, make sure that you have GPU and internet enabled. There are several libraries in python for deep learning however, we will use fastai.
Link The full code is available here.
Let’s start coding!!!
# Get automatic reloading and inline plotting %reload_ext autoreload %autoreload 2 %matplotlib inline
Just some basic commands as practice, autoreload reloads modules automatically before entering the execution and matplotlib inline is a magic command that plots your outputs better.
### Import Required Libraries # Using Fastai Libraries from fastai.imports import * from fastai.transforms import * from fastai.conv_learner import * from fastai.model import * from fastai.dataset import * from fastai.sgdr import * from fastai.plots import * import numpy as np import pandas as pd import torch import os PATH = "../input" print(os.listdir(PATH)) TMP_PATH = "/tmp/tmp" MODEL_PATH = "/tmp/model/" sz= 224 bs = 58 arch = resnet34
Defining some variables:
I know in this series we have not yet covered how the convolution function and in particular how CNN’s work. However, for now all we need to know is that CNN is a type of neural network popular for image classification and Resnet is a type of architecture. Resnet-34 has 34 layers!
The programming framework used to behind the scenes to work with NVidia GPUs is called CUDA. Further, to improve performance, we need to check for NVidia package called CuDNN (special accelerated functions for deep learning).
### Checking GPU Set up print(torch.cuda.is_available()) print(torch.backends.cudnn.enabled)
Both of these should be true.
Now let’s look at what form the data is in, that is we need to understand how the data directories are structured, what the labels are and what some sample images look like. f’ is a convenient way to reference a path/string.
files = os.listdir(f'{PATH}/train')[:5] ## train contains image names print(files) img = plt.imread(f'{PATH}/train/{files[0]}') plt.imshow(img); print(img.shape)
We get the height, width and channels using img.shape. In img[:4,:4], img is a 3 dimensional array giving us the value for Red Green Blue pixel values. The image above should give us an idea of the height of the image. Now, let’s split the data into train and validation set.
label_csv = f'{PATH}/train_labels.csv' n = len(list(open(label_csv))) - 1 # header is not counted (-1) val_idxs = get_cv_idxs(n) # random 20% data for validation set print(n) #Total Data size print(len(val_idxs)) #Validation dataset size
label_df = pd.read_csv(label_csv) ### Count of both classes label_df.pivot_table(index="invasive", aggfunc=len).sort_values('name', ascending=False)
Label CSV contains the name and the corresponding label (1 or 0) where 1 means it has an invasive tag.Table 1: Target Variable Distribution
Label | Count |
---|---|
1 | 1448 |
0 | 847 |
tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1) data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}/train_labels.csv', test_name='test', val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)
tfms stands for transformations. tfms_from_model takes care of resizing, image cropping, initial normalization and more.A pre-defined list of functions are carried on in transforms_side_on. We can also specify random zooming of images up to specified scale by adding the max_zoom parameter.
With ImageClassifierData.from_csv we are just putting together everything (train, validation set, the labels and batch size).
fn = f'{PATH}/train' + data.trn_ds.fnames[0] #img = PIL.Image.open(fn) size_d = {k: PIL.Image.open(f'{PATH}/' + k).size for k in data.trn_ds.fnames} row_sz, col_sz = list(zip(*size_d.values())) row_sz = np.array(row_sz); col_sz = np.array(col_sz) plt.hist(row_sz);
A plot of the distribution of the size of the images. Ideally, we want all images to have a standard size to allow easier computation.
Our first model: To make the process quick we will first run a pre-trained model and observe the results. Further, we can tweak the model for improvements. A pre-trained model means a model created by some one else to solve a different problem, the weights corresponding to the activation function are saved/trained based on their dataset. We will try out their weights as is, that is instead of coming up with our own weights specific to our dataset, we will just use their weights. This is what we call transfer learning.
Is that a good idea?
Well, usually these weights are attained by training on a very large dataset for example Imagenet. It helps speed up the your training process.
We have train set with 1836 images and test set with 1531 which is not much to attain a high accuracy model where weights are trained from scratch. Further, in the article regarding the black box we had observed how gradients and edges are found in the initial layer of a neural network. That is useful information for our use case as well.
Let us form a function to get the data and resize images if necessary.
def get_data(sz, bs): # sz: image size, bs: batch size tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1) data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}/train_labels.csv', test_name='test', val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs) return data if sz > 500 else data.resize(512,TMP_PATH) # Reading the jpgs and resizing is slow for big images, so resizing them all to standard size first saves time
data = get_data(sz, bs) learn = ConvLearner.pretrained(arch, data, precompute=True,tmp_name=TMP_PATH, models_name=MODEL_PATH) learn.fit(1e-2, 3)
ConvLearner.pretrained builds learner that contains a pre-trained model. The last layer of the model needs to be replaced with the layer of the right dimensions. The pretained model was trained for 1000 classes therfore the final layer predicts a vector of 1000 probabilities. However, what we need is only a two dimensional vector. The diagram below shows in an example how this was done in one of the earliest successful CNNs. The layer “FC8” here would get replaced with a new layer with 2 outputs.
Parameters are learned by fitting a model to the data. Hyperparameters are another kind of parameter, that cannot be directly learned from the regular training process. These parameters express “higher-level” properties of the model such as its complexity or how fast it should learn. In learn.fit we provide the learning rate and the number of epochs (times we pass over the complete dataset).
The output of learn.fit is:Table 2: Loss/Accuracy By Epoch
epoch | trn_loss | val_loss | accuracy |
---|---|---|---|
0 | 0.379021 | 0.196531 | 0.932462 |
1 | 0.285149 | 0.168239 | 0.947712 |
2 | 0.229199 | 0.14343 | 0.947712 |
94% accuracy on our first model!!!
Let’s form some function to try and understand what the model is doing correct and wrong. we will explore:
# this gives prediction for validation set. Predictions are in log scale log_preds = learn.predict() print(log_preds.shape) preds = np.argmax(log_preds, axis=1) # from log probabilities to 0 or 1 probs = np.exp(log_preds[:,1]) # pr(1) # Where Species = Invasive is class 1 def rand_by_mask(mask): return np.random.choice(np.where(mask)[0], min(len(preds), 4), replace=False) def rand_by_correct(is_correct): return rand_by_mask((preds == data.val_y)==is_correct) def plots(ims, figsize=(12,6), rows=1, titles=None): f = plt.figure(figsize=figsize) for i in range(len(ims)): sp = f.add_subplot(rows, len(ims)//rows, i+1) sp.axis('Off') if titles is not None: sp.set_title(titles[i], fontsize=16) plt.imshow(ims[i]) def load_img_id(ds, idx): return np.array(PIL.Image.open(f'{PATH}/'+ds.fnames[idx])) def plot_val_with_title(idxs, title): imgs = [load_img_id(data.val_ds,x) for x in idxs] title_probs = [probs[x] for x in idxs] print(title) return plots(imgs, rows=1, titles=title_probs, figsize=(16,8)) if len(imgs)>0 else print('Not Found.') def most_by_mask(mask, mult): idxs = np.where(mask)[0] return idxs[np.argsort(mult * probs[idxs])[:4]] def most_by_correct(y, is_correct): mult = -1 if (y==1)==is_correct else 1 return most_by_mask(((preds == data.val_y)==is_correct) & (data.val_y == y), mult)
Let’s take a look at what we get if we were to call these functions. Keep in mind our classification threshold is 0.5.
# 1. A few correct labels at random plot_val_with_title(rand_by_correct(True), "Correctly classified")
# 2. A few incorrect labels at random plot_val_with_title(rand_by_correct(False), "Incorrectly classified")
# Most correct classifications: Class 0 plot_val_with_title(most_by_correct(0, True), "Most correct classifications: Class 0")
# Most correct classifications: Class 1 plot_val_with_title(most_by_correct(1, True), "Most correct classifications: Class 1")
# Most incorrect classifications: Actual Class 0 Predicted Class 1 plot_val_with_title(most_by_correct(0, False), "Most incorrect classifications: Actual Class 0 Predicted Class 1")
# Most incorrect classifications: Actual Class 1 Predicted Class 0 plot_val_with_title(most_by_correct(1, False), "Most incorrect classifications: Actual Class 1 Predicted Class 0")
# Most uncertain predictions most_uncertain = np.argsort(np.abs(probs -0.5))[:4] plot_val_with_title(most_uncertain, "Most uncertain predictions")
Scope of Improvement:
## How does loss change with changes in Learning Rate (For the Last Layer) learn.lr_find() learn.sched.plot_lr()
The method learn.lr_find() helps you find an optimal learning rate. It uses the technique developed in the 2015 paper Cyclical Learning Rates for Training Neural Networks, where we simply keep increasing the learning rate from a very small value, until the loss stops decreasing.
# Note that the loss is still clearly improves till lr=1e-2 (0.01). # The LR can vary as a part of the stochastic gradient descent over time. learn.sched.plot()
We can see the plot of loss versus learning rate to see where our loss stops decreasing:
Now, that we have an idea of how to select our learning rate. To set the number of epochs, we just need to ensure that there is no over-fitting. Let’s talk about data augmentation.
Data augmentation is a good step to prevent over-fitting. That is, by cropping/zooming/rotating the image, we can ensure that the model does not learn patterns specific to the train data and generalizes well to new data.
def get_augs(): tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1) data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}/train_labels.csv', bs = 2, tfms=tfms, suffix='.jpg', val_idxs=val_idxs, test_name='test') x,_ = next(iter(data.aug_dl)) return data.trn_ds.denorm(x)[1] # An Example of data augmentation ims = np.stack([get_augs() for i in range(6)]) plots(ims, rows=2)
With precompute = TRUE, all layers of the Neural network are set to frozen excluding the last layer. Thus we are only updating the weights in the last layer with our dataset. Now, we will train the model with the option precompute as false and cycle_len enabled. Cycle Length uses a technique called stochastic gradient descent with restarts (SGDR), a variant of learning rate annealing, which gradually decreases the learning rate as training progresses. In other words, SGDR reduces the learning rate every mini-batch, and reset occurs every cycle_len epoch. This is helpful because as we get closer to the optimal weights, we want to take smaller steps.
learn.precompute=False learn.fit(1e-2, 3, cycle_len=1)
Table 3: Loss/Accuracy By Epoch
epoch | trn_loss | val_loss | accuracy |
---|---|---|---|
0 | 0.221001 | 0.1623 | 0.943355 |
1 | 0.232999 | 0.179043 | 0.941176 |
2 | 0.224435 | 0.148815 | 0.947712 |
Calling learn.sched.plot_lr() once again:
To unfreeze layers however, we will call unfreeze. We will also try differential rates for the respective layers.
learn.unfreeze() lr=np.array([1e-4,1e-3,1e-2]) learn.fit(lr, 3, cycle_len=1, cycle_mult=2)
Table 4: Loss/Accuracy By Epoch
epoch | trn_loss | val_loss | accuracy |
---|---|---|---|
0 | 0.323539 | 0.178492 | 0.923747 |
1 | 0.247502 | 0.132352 | 0.949891 |
2 | 0.192528 | 0.128903 | 0.954248 |
3 | 0.165231 | 0.101978 | 0.962963 |
4 | 0.141049 | 0.106319 | 0.960784 |
5 | 0.121947 | 0.103018 | 0.960784 |
6 | 0.107445 | 0.100944 | 0.965142 |
Improved our model, 96.5% accuracy…
Above, we have the learning rate of the final layers. The learning rates of the earlier layers are fixed at the same multiples of the final layer rates as we initially requested (i.e. the first layers have 100x smaller, and middle layers 10x smaller learning rates, since we set lr=np.array([1e-4,1e-3,1e-2]).
To get a better picture, we can use Test time augmentation (learn.TTA()), that is we use data augmentation techniques on our validation set. Thus, by making predictions on both the validation set images and their augmented images, we will be more accurate.
Our confusion matrix:
Our final accuracy was 96.73% and upon submission to the public leader-board we got 98%.
Data Exploration:
Models Tweaking:
Let’s try to take a sneak peak inside the black box of deep learning and try to build some intuition along the way.
In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources and written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.
We have covered the origins and understood a little bit about the structure of neural networks in the previous articles. However, before we further dive into the math behind the working of neural networks, we need to polish our understanding of what is going on inside the black box.
Deep learning algorithms are mostly a black box. We do not know what patterns are being observed that trigger an activation function. We can make a guess for example when it classifies a “Dog” in Cats vs Dogs dataset, it probably saw the ears or the shape of the dog’s face. But this uncertainty would not work when these algorithms are being used in self driving cars. In such use cases we need to know why the algorithm is working the way it is.
In Neural networks it is not necessary that a Neuron will be fired up for all the images. That is, a neuron will be activated only for a select features that are present in the input images.
Well, some light was shed on feature visualization by Matthew D. Zeiler and Rob Fergus in a research paper that is available here.
They put together a novel method to decode these features. First, they trained a normal CNN ( convolutional neural network a type of Neural Networks) to classify images. At the same time, they also trained a backward looking network.
To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached.
If we pass image A into our CNN and it passes through layer K and only neuron M ends up being activated. Now, the backward looking network (Deconvnet) will be used to reconstruct the status in the previous step. That is, based on the output of layer K, we have set all other activations to 0 except M and now we are trying to revert whatever activity happened in the previous layer (neuron M).
Thus, our goal here is to understand what feature activates a neuron. Let’s say we are training our network on cats vs dogs dataset, now we start focusing on a single neuron and ignoring all other neurons with the purpose of understanding what activated that neuron. Maybe the dog’s ears are the defining feature for this neuron and that is what this neuron looks for in every image. Now using this methodology we can explore how things are working layer by layer. In the images below you will notice the actual images and what the neural network is observing:
In layer 1 CNN is able to identify color gradients and as we look at deeper layers, more complex patterns. The patterns are emerging going from gradients to edges/shapes to complex features like eyes.
It was necessary to supply several images to the network to see what activates the selected neuron. This becomes computationally intensive. There is another way, what if we supply a image created with random pixels and try to find out what would excite one particular neuron. We use an image similar to the one shared below, and run it through a neural network with only one neuron activated. The neural network is trying to understand how to change the color of each pixel to increase the activation of that neuron. More information is available here in the paper by Jason Yosinski.
So how do these activated images look like:
Hope this gives you a sneak peak into how neural networks work especially with image data. If you wish to further explore the same, please have a look at this amazing blog post at distill pub.
Let’s say you are working on predicting the future sales of a retail store. A neuron might get activated or give higher weights to certain inputs. For example: the variables item category and season of sale. For simplicity try to think of this as weights given in linear regression. Why would these two variables cause the needle to shift?
Well, maybe there are some seasonal products in the data set which activate that particular neuron. Similarly we can develop some intuition of which variables are influencing our neurons.
Let’s dive into the core of neural networks. Understand the concept of Gradient Descent and Back-propagation to get some idea of how Neural Networks work. Warning some math involved! Don’t worry, we will first try to explain it in an intuitive manner and then explore some math behind it.
]]>Understand how an activation function behaves for different neurons and connect it to the grand architecture. The concept of different type of activation functions explored in detail.
In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources and written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.
We had briefly discussed activation functions in the blog regarding the architecture of neural networks. But lets improve are understanding by diving into this topic further.
Activation functions are essentially the deciding authority, on whether the information provided by a neuron is relevant or can be ignored. Drawing a parallel to our brain, there are many neurons but all the neurons are not activated by an input stimuli. Thus, there must be some mechanism, that decides which neuron is being triggered by a particular stimuli. Let’s put this in perspective:
The output signal will be attained only if the neuron is activated. Consider, the neuron A that is providing the weighted sum of inputs along with a bias term.
Thus, we are simply doing some linear matrix transformations and as mentioned in the deep learning architecture blog, just doing a linear operation is not strong enough. We need to add some Non-Linear Transformations, that is where Activation functions come into the picture.
Also, the range of this function is -inf to inf. When we get an output from the neural network, this range does not make sense. For example if we are classifying images as Dogs or Cats, what we need is a binary value or some probability thus we need a mapping to a smaller space. The output space could be between 0 to 1 or -1 to 1 and so on depending on the choice of function.
So to summarize we need the activation functions to introduce non-linearities, get better predictions and reduce the output space.
Now, let’s do a simple exercise, given this idea regarding activation of neurons how would you come up with an activation function. What we want is a binary value suggesting if a Neuron is activated or not.
First thing that comes to mind is defining a threshold. If the value is beyond a certain threshold declare it as activated. Now, if we are defining this function for the space 0 to 1, we can easily say for any value above 0.5 consider the neuron activated.
Wow! We have our first activation function. What we have defined here is a Step Function also known as Unit or Binary Step function.
Advantage
Disadvantage
Thus we want that the Activation function is differentiable because of how back-propagation works.
Now let’s take look at the large picture. There are multiple neurons in our neural network. We had discussed in the blog regarding the intuitive understanding of neuron networks how neuron networks look for patterns in images. If you haven’t read it or have some idea about it, all you need to know is that different neurons might select or identify different patterns. Revisiting the Dog vs Cat image identification example, if multiple neurons are being activated what will happen?
With the use case defined above, let’s try to a linear function as we have figured out that a binary function didn’t help much. f(X) = CX, straight line function where activation is proportional to the weighted sum from neuron. If more than one neuron gets activated then we can take the max value for the neuron activation values, that way we have only 1 neuron to be concerned about.
Oh wait! the derivative of this function is a constant value. f’(X) = C.What does that mean?
Well, this means that every time we do a back propagation, the gradient would be the same and there is no improvement in the error. Also, with each layer having a linear transformation, the final output is also a linear transformation of the input. Further, a space of (-inf,inf) sounds difficult to compute. Hence, not desirable.
Let’s pull out the big guns. A smoother version of the step function. It is non-linear and can be used for non-binary activations. It is also continuously differentiable.
Most values lie between -2 and 2. Further, even small changes in the value of Z results in large changes in the value of the function. This pushes values towards the extreme parts of the curve making clear distinctions on prediction. Another advantage of sigmoid activation is that the output lies in the range between 0 and 1 making an ideal function for use cases where probability is required.
That’s sounds all good! Then what’s the issue? After +3 and -3 the curve gets pretty flat. This means that the gradient at such points will be very small. Thus, the improvement in error will become almost zero at these points and the network learns slowly. This is known as vanishing gradients. There are some ways to take care of this issue. Others issue are the computation load and not being zero-centered.
Hyperbolic tangent activation function is very similar to the sigmoid function.
Compare the formula of tanh function with sigmoid: tanh(x) = 2 sigmmoid(2x) – 1
To put it in words, if we scale the sigmoid function we get the tanh function. Thus, it has similar properties to the sigmoid function. The tanh function also suffers from the vanishing gradient problem and therefore kills gradients when saturated. Unlike sigmoid, tanh outputs are zero-centered since the scope is between -1 and 1.
To truly address the problem of vanishing gradients we need to talk about Rectified linear unit (ReLU) and Leaky ReLU. ReLU (rectified linear unit) is one of the most popular function which is used as hidden layer activation function in deep neural network.
g ( z ) = max { 0 , z }
When the input x < 0 the output is 0 and if x > 0 the output is x. As you can see that a derivative exists for all values except 0. The Left derivative is 0 while right derivative is 1. That’s a new issue, how will it work with gradient descent. In practice at 0 it is more likely that the true value close to zero or rounded to zero, in other words it is rare to find this issue in practice. Software implementations of neural network training usually return one of the one-sided derivatives instead of raising an error. ReLU is computationally very efficient but it is not a zero-centered function. Another issue is that if x < 0 during the forward pass, the neuron remains inactive and it kills the gradient during the backward pass. Thus weights do not get updated, and the network does not learn.
This is a modification of ReLU activation function. The concept of leaky ReLU is when x < 0, it will have a small positive slope of 0.1. This feature eliminates the dying ReLU problem, but the results achieved with it are not consistent. Though it has all the characteristics of a ReLU activation function, i.e., computationally efficient, converges much faster, does not saturate in positive region.
f(x) = max(0.1*x,x)
There are many different generalizations and variations to ReLU such as parameterized ReLU.
To tie things together let’s discuss one last function. It is often used in the final layer of Neural networks. The softmax function is also a type of sigmoid function that is often used for multi-class classification problems.
Look at the numerator, as we are taking an exponential of Zj, it will result in a positive value. Further, even small changes in Zj result in largely variant values (exponential scale). The denominator is the summation of all Exp(Zj) that is the probabilities end up adding to 1.
This makes it perfect for classifying multiple classes. For example, if we want to detect multiple labels in an image such as for satellite images of a landscape you might find water, rain forests, land and so on as the labels.
Many activation functions and their characteristics have been discussed here. But the final question is when to use what?
There is no exact rule for choice rather the choice is determined on the nature of your problem. Keep the characteristics of the activation functions in mind and choose the one that suits your use case and will provide faster convergence. The most used in practice is to use ReLU for the hidden layers and sigmoid (binary classification example Cat vs Dog) or softmax (multi class classification) in the final layer.
]]>