The post NLP with Deep Learning appeared first on Data Science Discovery.
]]>Our focus is to solve text classification problem using deep learning. To reiterate the problem of NLP (Natural Language Processing) based text classification below:
In today’s world, websites have to deal with toxic and divisive content. Especially major websites like Quora which cater to large traffic and their purpose is to provide a platform to people for asking and answering questions. A key challenge is to weed out insincere questions, those founded upon false premises or questions that intend to make a statement rather than look for helpful answers.
A question is classified as insincere if:
For more information regarding the challenge you can use the following link.
The full deep learning code used is available here.
In part 1, we saw how machine learning algorithms could be applied to text classification. We had to identify and create a variety of features to reduce the complexity in the data. This required considerable effort but was essential for the learning algorithms to be able to detect patterns in the data.
In this section, we will approach the same problem using deep learning techniques. Deep learning has become the state of the art method for a variety of classification tasks. But first thing we need to understand is the motivation for the same, which has been outlined here:
That all sounds great, but what is the difficult part of this task. Well the major problem is defining the right architecture. But how do we get started, our primary task should be to understand the type of neural network we wish to use and further the architecture of the neural network. In other words we need to make a choice based on the following parameters:
Let’s understand some key concepts before we proceed further:
Words represented as real valued vectors are what we call word embeddings. The value associated with each word is either learned by using a neural network on a large dataset with a predefined task (like document classification) or by using an unsupervised process such as using document statistics. These weights can be used as a part of transfer learning to take advantage of the reduced training time and better performance.
Consider how we make sense of a sentence, we not only look at a word but also how it fits with the preceding words. Recurrent neural networks (RNN) take into consideration both the current inputs as well as the preceding inputs. They are suitable for sequence problems because their internal memory stores important details about the inputs that they received in the past which helps them precisely predict the output of the next time step.
GRU and LSTM are improved variations of the vanilla RNN, which tackle the problem of vanishing gradients and handling of long term dependencies.
Consider a simple use case that we are trying to infer the weather based on the conversations between two people. Let’s take the following text as an example
“We were walking on the road when it started to pour, luckily my friend was carrying an umbrella.”
The target variable here has a classification of “rain”. In such a case RNN has to keep “pour” in memory while considering “umbrella” to correctly predict rain. As the occurrence of umbrella alone is not definitive proof of rain. Similarly the occurrence of “didn’t” before “pour” would have changed everything.
A bidirectional RNN first forward propagates left to right, starting with the initial time step. Then starting at the final time step, it moves right to left until it reaches the initial time step. This learning of representations by incorporating future time steps helps understand context better.
In other words let’s go back to our example of deciphering the weather based on conversations. It makes sense to make a leap from “pour” to “umbrella” starting reading from left to right. But what if we went right to left, that will just add to the power of the model as may be in another conversation we have a different occurrence pattern of words for example:
“I took out an umbrella as it started to pour.”.
I think the best way to describe attention is by having a look at a basic CNN use case of image classification (Dogs vs Cats). If you are given an image of a dog what is the most defining characteristic that helps you differentiate. Is it the dog’s nose or ears? The attention mechanism blurs certain pixels and focuses on only a portion of the image. Thus, it assigns a weight and tells the model what to focus on.
In our context, attention model takes into account input from several time steps back and assigns a weight, signifying interest, to each of them. Attention is used to pay attention to specific words in the text sequence for example over a large dataset the attention layer will give more weightage to words like “rain”, “pour”, “umbrella” and so on.
We used Stochastic weight averaging to update the weights of our network for the following reasons-
The image below (Illustrations of SWA and SGD with a Pre-activation ResNet-164 on CIFAR-100) shows how well SWA generalizes and results in better performance. On the left we have W1, W2 and W3 as weights of three independently trained networks and Wswa is the average of the three. This holds even though on the right we see a greater train loss by SWA as compared to SGD.
The essence of this learning rate policy comes from the observation that increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial effect. This observation leads to the idea of letting the learning rate vary within a range of values rather than adopting a stepwise fixed or exponentially decreasing value. That is, one sets minimum and maximum boundaries and the learning rate cyclically varies between these bounds based on a predefined function. It is possible that during gradient descent we are stuck at the local minimum but a cyclical learning rate can help jump out to a different location moving towards the global optimum.
This a technique specifically used to prevent over fitting. It basically involves dropping some percentage of the units during the training process. By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections. The units are selected randomly. Dropout randomly zeros out activations in order to force the network to generalize better, do less overfitting, and build in redundancies as a regularization technique.
There are different ways to drop values. Think of pixels in an image, these pixels will be very correlated to its neighbours, in such a case randomly dropping a pixels will not accomplish anything. That is where techniques like spatial dropout come into the picture. Spatial dropout involves dropping entire feature maps. For explanation purposes consider an image cropped into smaller segments and each being mapped using a function. Out of all these mapped values we randomly delete some of them.
If we take textual data you can think of it as dropping entire phrases and forcing the model to generalize better.
Rather than having varying ranges in your data we often normalize the data set to allow faster convergence. The same principle is used in neural networks for the input of the hidden layer. It involves a covariance shift which helps the network generalize better. That means even if the value in the train set and test set are vastly different, by normalizing it we reduce overfitting and help get better results.
Global average and global max pooling reduce the spatial size of the feature map/ representation to one feature map for each category (classification task).
Let’s say that we have an image of a dog. Global average pooling will take the average of all the activation values and tell us about the overall strength of the image, i.e. whether the image is of a dog or not.
Global max pooling, on the other hand, will take the maximum of the activation values. This will help identify the strongest trait of the image, say, ears of the dog. Similarly, in the case of textual data, global max pooling highlights the phrase with the most information while global average pooling indicates the overall value of the sentence.
Let’s dive deeper into the choices made and connect the dots between understanding the components of a neural network to actually forming one. Well, one of the most important factors that comes into play while deciding on the architecture is past experience and experimentation.
We first started with a very basic model (GRU) and plotted the accuracy and loss. We noticed during our experimentation that given the nature of our dataset it was very easy to overfit. Observe the graph below obtained on using a bidirectional GRU with 64 units/neurons and a single hidden layer (16 units):
Clearly, we start to overfit very quickly and we cannot keep the number of epochs high. Similarly, it also suggests that a very complex model (One with multiple layers and several nodes) will also be prone to overfitting. This also suggests the importance of regularization techniques like dropout in the architecture.
We also carried out error analysis to further get an idea of the performance of the baseline model.
We intended to understand the following:
What are insincere topics where the network strongly believes to be sincere ?
class 0 [0.94] Should we kick out Muslims from India? class 0 [0.94] Why does Quora keep posting this leftist propaganda? Are they owned by a liberal media conglomerate? class 0 [0.94] Why don’t the Christian and Muslim quarter be under Palestinian control and the Jewish quarter be under Israeli control?
Our baseline model has shown an F1 score of 0.63 yet these sentences are being tagged as sincere. What is the issue?
Post a deep dive into the dataset we noticed that there are several misclassified cases in the dataset. In the competition details they have specified that the method used for classification is a combination of manual and some ML/rule based efforts. This also brings our attention to a real world scenario that can exist in any ML/DL project. Ideally, we would like to reclassify such cases with the wrong tag as the model is currently learning from misclassified examples as well. However, that can not be done given the constraints of the competition.
What are sincere topics where the network strongly believes to be insincere ?
class 1 [0.00000001] What astrological combinations are needed to obtain a scholarship for studies? class 1 [0.00000001] How many karats needed for a karat farm? class 1 [0.0000001] How many SITA devices are present in Ramayana?
What are insincere/sincere topics where the network are most uncertain how to classify ?
Keep in mind our chosen threshold for converting probabilities into a binary is: 0.4
class 0 [0.40] Why do most people not consider the US a two party system, even though any party besides Democrat/Republican realistically has no chance for a win? class 0 [0.40] Feigning ignorance of who wrote it, could you have mistaken "Something" for Paul McCartney number if he had sung it instead of George? class 1 [0.399] Why can't we make radiation death rays with huge spread and nearly infinite range to nullify Western air power once and for all and put a stop their invasions of our sovereignty? class 1 [0.399] What country is more tolerant to gays, Saudi Arabia or USA?
What are insincere topics where the network strongly believes correctly ?
class 1 [0.99] Why are the Quora moderators f*cking ***holes? class 1 [0.98] Why is it that Quora is so f*cking liberal and stupid?
This gives us a general idea of what is happening and help us develop some intuition of where the model shines and where it goes wrong. In particular we also get to see some words that might be clearly tipping the classification scales in a particular direction.
Let’s start with the input data, we have used an average of the pre-trained GloVe and Paragram word embeddings. The reason for this choice was that Glove and Paragram were found to contain above 80% of our corpus. Further, taking an average of these embeddings was giving better results on our baseline model.
Spatial Dropout is being used immediately after the embeddings. This makes sure are model is more robust during training and prevents over-fitting.
Following which we have used a bidirectional LSTM layer, with 40 units. We decided to split this model into two pathways:
The best way to look at this is like we have made two branches, the prior (LSTM – Attention) maintains the simplicity and the latter branch allows the model to learn more by having an additional layer. The reason we selected 40 units is mostly based of experimentation and intuition, we noticed that by having a larger number of units the model started to over fit almost immediately.
In the latter branch, the output of the 2nd bidirectional LSTM layer was being used for three operations, namely, an attention layer, global average pooling and global max pooling. Now, each of these layers bring forth diverse features from the data and contain 80 units each.
All these outputs (from both prior and latter branches) are concatenated (as in the concatenation of 4 outputs we have 320 units) and fed into a layer of 32 units, with a RELU activation. This is followed by batch normalization and dropout to speed the computation and help reduce over-fitting. After this, we have the final output later with a sigmoid activation function.
Kaggle had set the evaluation metric to be the F1 score. This was a suitable choice, instead of accuracy, because of the class imbalance present in the dataset. Moreover, due to some of the questions being labelled incorrectly, techniques used to handle class imbalance, such has undersampling and oversampling, might actually increase the incorrectly labelled questions or decrease the correctly labelled ones. Further, computational constraints were another important factor to keep in mind while making any decision.
Our approach for model validation included creating a train and validation data set. We ran 10 epochs which is the maximum we could run with this model as post this we would start overfitting or violate computational constraints. We started SWA from the 4th epoch as at that point the F1 score had already reached close to 0.65. Thus, it was good point to start the process.
Once we had the final predictions from the model we used a threshold to binarize the probabilities which was obtained on the basis of the validation dataset.
Kaggle’s score calculations process involved only 15% data for the public leaderboard and remaining for the private leaderboard. Our final model returned a score of 0.68 on the public leaderboard and around 0.68875 in the private leaderboard. This stability in the score was a good demonstration of a good generalized model.
Here, is a look at the confusion matrix:
The traditional ML approach yielded a score of 0.583 as compared to the deep learning model’s score of 0.68.
While the deep learning model clearly outperformed the traditional ML stacking, there are a few points to consider before you set the course of your text classification problem:
Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.
About the writers:
The post NLP with Deep Learning appeared first on Data Science Discovery.
]]>The post NLP with ML appeared first on Data Science Discovery.
]]>Purpose:
Natural language processing (NLP) has been widely popular, with the large amount of data available (in emails, web pages, sms) it becomes important to extract valuable information from textual data. An assortment of machine learning techniques designed to accomplish this task. With current advances in deep learning, we felt it would be an interesting idea to compare traditional and deep learning techniques. We decided to pick up a playground kaggle data set with the purpose of text classification and proceeded to implement both these types of algorithms for comparison purposes.
In today’s world, websites have to deal with toxic and divisive content. Especially major websites like Quora which cater to large traffic and their purpose is to provide a platform to people for asking and answering questions. A key challenge is to weed out insincere questions, those founded upon false premises or questions that intend to make a statement rather than look for helpful answers.
A question is classified as insincere if:
For more information regarding the challenge you can use the following link.
The full code is available here.
In this article we will tackle text classification by using machine learning and NLP techniques. For any data science problem with textual data the common steps include:
Let’s explore them step by step in more detail.
One of the most important steps of any project, you need to familiarize yourself with the data prior to implementing any modeling technique.
import os print(os.listdir("../input"))
Our dataset includes:
We are not allowed to use any external data sources. The following embeddings are given to us which can be used for building our models.
A look at the size of our train and test data:
In the target variable 1 represents the class Insincere and 0 the Sincere class of questions.
Let’s explore the distribution of the target variable:
import seaborn as sns color = sns.color_palette() %matplotlib inline from plotly import tools import plotly.offline as py py.init_notebook_mode(connected=True) import plotly.graph_objs as go cnt_srs = train_df['target'].value_counts() ## target distribution ## labels = (np.array(cnt_srs.index)) sizes = (np.array((cnt_srs / cnt_srs.sum())*100)) trace = go.Pie(labels=labels, values=sizes) layout = go.Layout( title='Target distribution', font=dict(size=18), width=600, height=600, ) data = [trace] fig = go.Figure(data=data, layout=layout) py.iplot(fig, filename="usertype")
These box plots shared below will help understand if there are any patterns in the dataset regarding the word count or the number of characters.
Per question Insincere questions have more words
Insincere questions > characters than sincere questions
Sincere questions have lesser punctuation’s
More upper case words in sincere questions
For questions classified as sincere we see general words like “will”, “one” and so on. We also see the word “will” prevalent for insincere questions. During the data processing steps we will have to treat common words. Another point brought out in the word cloud is how words like “Trump”,”liberal” are very specific to insincere words, possibly because the person is making a statement about these topics rather than genuinely providing an answer.
Usually unstructured text data will be dirty that is it will have misspelled words, case-insensitive words and various other issues. We need to clean the text and bring it to a standardized form before extracting information from it as without this step there will be noise resulting in a poor model.
Broadly, consider the following steps:
Tokenization refers to the splitting of strings of text into smaller chunks or tokens. Paragraphs or large bodies of text are tokenized into sentences and then sentences are broken down into individual words.
This refers to a series of steps that transforms the corpus of text into a single standard and consistent form. The following steps are a part of this process:
Stemming, which involves chopping off the end of a word or inflectional endings (-ing, -ed etc.) to get its root form or stem, using crude heuristic rules.
burning -> burn.
Stemming generally works well most of the time, but can often return words which might not look correct intuitively.
difficulties -> difficulti
Lemmatization has the same goal as stemming. However, it uses a vocabulary and the morphological analysis of words, to remove inflectional endings and return the dictionary form of a word, known as the lemma. Unlike stemming, lemmatization aims to reduce the word properly so that it makes sense according to the language.
ran -> run, difficulties -> difficulty
The idea for stemming or lemmatization of words is to reduce words into a common form. For example, difficulties and difficulty will portray the same intent and context.
For our use case we have performed the following operations to clean the data (using the library NLTK):
import nltk from nltk.tokenize import word_tokenize from nltk.tokenize import sent_tokenize from nltk.corpus import stopwords from nltk.stem import PorterStemmer from nltk.stem.wordnet import WordNetLemmatizer #lower case all_data['question_text'] = all_data['question_text'].apply(lambda x: " ".join(x.lower() for x in x.split())) #Removing Punctuation all_data['question_text'] = all_data['question_text'].str.replace('[^\w\s]','') #Removing numbers all_data['question_text'] = all_data['question_text'].str.replace('[0-9]','') #Removing stop words and words with length <=2 from nltk.corpus import stopwords stop = stopwords.words('english') all_data['question_text'] = all_data['question_text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop and len(x)>2)) # Lemmatize from nltk.stem import WordNetLemmatizer wl = WordNetLemmatizer() all_data['question_text'] = all_data['question_text'].apply(lambda x: " ".join(wl.lemmatize(x,'v') for x in x.split()))
This part is what makes the difference between a good and a bad solution in any ML project. So what features can we create in our usecase. We can start with understanding the sentiment.
Sentiment is a part of opinion mining and it involves building a system to extract the opinion from a text. That is we wish to get a score to understand how positive or negative the text is.
The assumption with respect to our data set was perhaps the questions flagged as Insincere may contain toxic content and would exhibit a negative sentiment. However, as far as modeling features are concerned sentiment turned out to be a weak feature. On deeper evaluation, we noticed that there were several questions with high polarity scores with insincere and sincere tags.
Topic modelling is an approach to identify topics present across a corpus of text. A topic is defined as a repeating pattern of co-occurring terms in a corpus. A document contains multiple topics in varying proportions. So, for example, a document based on healthcare is more likely to contain a higher ratio of words like “doctor” and “surgery” than words such as “brakes” and “gear”, which indicate a theme of automobiles.
Using a technique like Latent Dirichlet Allocation to get the distribution of topics across the corpus would potentially help to get a sense of the themes discussed in the set of questions. Further, we hypothesize that there would be some difference between the topics of sincere and insincere questions.
The image below shows the distribution of the topics with respect to the different classes by taking an average.
Looking up top words from top topics from class insincere:
Looking up top words from top topics from class sincere:
Countvectorizer returns a matrix which shows the frequency of each term in the vocabulary per document. On the other hand, tf-idf (term frequency-inverse document frequency) evaluate how important a word is to a document in the corpus.
tf(x) = (Number of times term x appears in a document) / (Total number of terms in the document)
idf(x) = log(Total number of documents / Number of documents with term x in it)
tf-idf = *tf(x) * idf(x)*
Clearly, the importance of a word in a document increases proportionally to the number of times a it appears there. But, it is offset by the number of times it occurs in the corpus.
Both tf-idf and countvectorizer, as features, may indicate the relevance of a certain set of words to questions labelled as “Sincere” as well as “Insincere”.
The image below is obtained by using a TF-IDF vectorizer to create features and a k-fold CV logistic regression model and it shows the words (of insincere questions) with most weight.
The idea behind building features such as the number of unique words, characters or exclamation points is to check for uniformity in the data set. We wish to observe if there are some similarities between the train and test set. Some questions that these meta features help answer include:
The examples mentioned above give us the idea that there might be certain patterns specific to the respective classes that can be leveraged in our model. To give an ad hoc example of how useful meta features can be, on a musical note, the number of words per minute for Eminem is different based on the content/emotion of the song.
Some of the meta features are listed below:
We have added some box plots in the data exploration section to provide you with an idea regarding the prevalent distribution with respect to the different classes.
So far we have cleaned up our text and carried out feature engineering. Now, there are several ways to select the relevant features however, for the purpose of this article we decided to generate separate models for each set of features as this will help develop a general understanding and help utilize these tactics on other text classification datasets.
A few things to note:
There are two pieces of code that will be reused in most of the models:
kf = KFold(n_splits=5, shuffle=True, random_state=43) ## Initialize 0’s test_pred_ots = 0 oof_pred_ots = np.zeros([train.shape[0],]) train_target = train['target'].values x_test = test[selected_features].values ## Loop to split the data set for i, (train_index, val_index) in tqdm(enumerate(kf.split(train))): x_train, x_val = train.loc[train_index][selected_features].values, train.loc[val_index][selected_features].values y_train, y_val = train_target[train_index], train_target[val_index] # Model classifier = LogisticRegression(C= 0.1) classifier.fit(x_train, y_train) ## Validation set predicted val_preds = classifier.predict_proba(x_val)[:,1] ## Test set predictions preds = classifier.predict_proba(x_test)[:,1] test_pred_ots += 0.2*preds oof_pred_ots[val_index] = val_preds print("--- %s seconds for Model Selected Features ---" % (time.time() - start_time))
The code above runs 5 fold cross validation and with each split we train and make predictions on the validation and test datasets. At the end of all splits we get oof_pred_ots which are predictions on the validation data sets combined into a single data frame. We also get the average prediction probabilities of each split in test_pred_ots.
thresh_opt_ots = 0.5 f1_opt = 0 for thresh in np.arange(0.1, 0.91, 0.01): thresh = np.round(thresh, 2) f1 = metrics.f1_score(train_target, (oof_pred_ots.astype(float) >thresh).astype(int)) #print("F1 score at threshold {0} is {1}".format(thresh, f1)) if f1_opt < f1: f1_opt = f1 thresh_opt_ots = thresh print(thresh_opt_ots) pred_train_ots = (oof_pred_ots > thresh_opt_ots).astype(np.int) f1_score(train_target, pred_train_ots)
The code above will help find the best threshold.
First Model:
We used the text descriptive features and ran a 5-fold cross validation logistic regression model however, the F1 score is not that significant (0.27).
Second Model:
We used the sentiment and topic modeling features and ran the same model as mentioned before. This time we got a better score (0.34).
Third Model:
We used TFIDF features and tried logistic regression (F1 – 0.587) and light gbm (F1 – 0.591). This is much better.
Fourth Model:
We used countvectorizer and tried logistic regression (F1 – 0.592) and a multinomial (F1 – 0.55) and bernoulli (F1 – 0.53) naive bayes models.
The idea here is that one model might be observing patterns that the other isn’t. Further, ensemble will help get better results and at the same time reduce the chance of over fitting. We used stacking which means that we make predictions on the entire train set. This is accomplished by splitting data at each folds into train and holdout set and making predictions on the holdout set. This splitting of the data is carried out such that there is a prediction for each row in the train data set.
We use these new predictions from the respective models as input variables and run another (logistic regression) model on top of this giving us the final probabilities.
Our final F1 Score (0.604) and on the leader-board (0.589).
In the next article we will implement a deep learning approach to the same use case and draw comparisons between the two methodologies.
References:
Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.
About the writers:
The post NLP with ML appeared first on Data Science Discovery.
]]>The post Mechanics of Deep Learning appeared first on Data Science Discovery.
]]>Understand the concept of Gradient Descent and Back-propagation to get some idea of how Neural Networks work.
In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.
We have already covered some of the basics of the architecture and the respective components in the previous posts. But we need to understand one of the most important concepts.
How do Neural networks exactly work?
How are the weights updated in Neural networks?
Well, let’s get into the algorithms behind Neural Networks.
For most machine learning algorithms, optimization is used to minimize the cost/error function. Gradient Descent is one of the most popular optimization algorithms used in Machine Learning. There are many powerful ML algorithms that use gradient descent such as linear regression, logistic regression, support vector machine (SVM) and neural networks.
Intuition
Let’s take the classic mountain valley example with a twist, you meet a pirate and in your travels you discover a map to the golden chalice of wisdom. The secret location is the lowest point in a very dark and deep valley. Given that there is no possible sources of natural or artificial light in this magical valley, both the pirate and you are in a race to reach the bottom of the valley in pitch darkness. The pirate decides to take steps forward randomly with the hope of eventually reaching the lowest point.
Both of you have the same starting point, you think there must be a smarter way. At every step you decide to feel the gradient (slope) around you, and take the steepest step possible. By taking the best possible step every time, you win!
That is analogous to the gradient descent technique. We are operating in the blind trying to take a step in the most optimal direction.
Let us say that we fit a regression model on our dataset. We need a cost function to minimize the error between our prediction and the actual value. The plot of our cost function will look like:
Gradient is another word for slope and the first step in gradient descent is to pick a starting value at random or set it to 0. Now, a gradient has the following characteristics:
Let’s take a mathematical function to further understand the same.
In mathematical terms, if our function is:
$
f(x) = e^{2}\sin(x)
$
The derivative:
$
\frac {\partial f}{\partial x} = e^2\cos(x)
$
If x = 0
$
\frac{\partial f}{\partial x} (0) = e^2 \approx 7.4
$
So when you start at 0 and move a little (take a step), the function changes by about 7.4 times (magnitude) the amount that you changed. Similarly, if you have multiple variables we take partial derivatives:
$
z = f(x,y) = xy + x^2
$
For a function such as the one above we first take y as a constant and follow differentiate it in terms of x ( Here: y + 2x). Then we take x as a constant and take the derivative in terms of y (Here: x). Consider if x = 3 and y = -3 then f(x,y) = 9. The final value is obtained from the use of the chain rule of calculus.
$\nabla f $
the sign of the final gradient points in the direction of greatest change of the function.
In a feed-forward network, we are learning how does the error vary as the weight is adjusted. The relationship between the net’s error and a single weight will look something like the image below (we will get into more detail a little later):
As a neural network learns, it slowly adjusts several weights by calculating (dE/dw) the derivative of network Error with respect to the weights.
Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point. For example, if the gradient magnitude is 2.5 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.025 away from the previous point. If you pick a learning rate that is too small, learning will take too long and if you keep a very large learning rate the algorithm might diverge away from the minimum point (miss the minimum completely).
Finally, the weights are updated incrementally after each epoch (pass over the training dataset) till we get the best results.
In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration. So far, we have assumed that the batch has been the entire data set. But for large datasets, the gradient computation might be expensive.
stochastic gradient descent offers a lighter-weight solution. At each iteration, rather than computing the gradient ∇f(x), stochastic gradient descent randomly samples i at uniform and computes ∇fi(x) instead.
Back-propagation is simply a technique or method of updating the weights. We are aware of partial derivatives, chain rule and most importantly gradient descent. But with Neural networks having multiple layers and different activation functions make it difficult to visualize how everything comes together. Consider, a simple example with the following architecture:
Step 1: Initialization Let us initialize the weights and the bias.Table 1 a: Weight Initialization Example
Weights | Value |
---|---|
w1 | 0.10 |
w2 | 0.15 |
w3 | 0.03 |
w4 | 0.08 |
w5 | 0.18 |
w6 | 0.06 |
w7 | 0.11 |
w8 | 0.26 |
Table 1: Dominated/Non-Dominated Example
Bias | Value |
---|---|
b1 | 0.05 |
b2 | 0.42 |
Assume take the initial input values to be [0.95,0.06] and the target value [0.05,0.82].
Step 2: Calculations
To get the value of H1:
H1 = w1 * x1 + w2 * x2 + b1 = 0.1 * 0.95 + 0.15 * 0.06 + 0.05 = 0.154
As we have a sigmoid activation function:
$
\frac{1}{1+e^{-X}}
H1 = \frac{1}{1+e^{-H1}} = \frac{1}{1+e^{-0.154}} = 0.538
$
Similarly, we can calculate H2.
H1 = 0.538 and H2 = 0.52
Now we calculate the value for output nodes Y1 and Y2.
Y1 = w5 * H1 + w6 * H2 + b2 = 0.18 * 0.538 + 0.06 * 0.52 + 0.42 = 0.548
$
Y1 = \frac{1}{1+e^{-Y1}} = \frac{1}{1+e^{-0.548}} = 0.633
$
Upon calculation:
Y1 = 0.633 & Y2 = 0.648
Step 3: Error Function Let the error function be:
$
J( \theta ) = {( {target – {output}})^2}
$
Total Error (E) = E1 + E2 = 0.184972 E1 = 0.5 * (0.05 - 0.63368)^2 = 0.17 E2 = 0.5 * (0.82 - 0.64893)^2 = 0.014
Back-propagate the Errors to update the weights.
Error at W5:
$
\partial E \over \partial W5
$
$
= ({\partial E \over \partial output Y1}) * ({\partial output Y1 \over \partial Y1}) * ({\partial Y1 \over \partial W5})
$
Component 1: The Cost/Error Function
target: T output: out E = 0.5 * (T1 - out Y1)^2 + 0.5 * (T2 - out Y2)^2 Differentiating: - (T1 - out Y1) = - (0.05 - 0.63368) = 0.58368
Component 2: The Activation function
output: out out Y1 = 1/(1 + exp(-Y1)) Differentiating: out Y1 * (1 - out Y1) = 0.63368 * (1 - 0.63368) = 0.23213
Component 3: The Function of Weights
Y1 = w5 * H1 + w6 * H2 + b2 Differentiating: H1 * 1 = 0.538
Finally, we have the change in W5:
$ \partial E \over \partial W5
$
=0.58368∗0.23213∗0.538
=0.07289
In order to update W5 recall the discussion on gradient descent. Let alpha be learning rate with a chosen value of 0.01.
Updated W5 will be:
$
W5 + \alpha * ({\partial E \over \partial W5})
$
=0.18+0.01∗0.07289
=0.1807289
Similarly, we can update the remaining weights. Let’s have a look at the formula to update W1:
\frac{\partial E}{\partial w1}
equals
$
(\sum\limits_{i}{\frac{\partial E}{\partial out_{i}} * \frac{\partial out_{i}}{\partial Y_{i}} * \frac{\partial Y_{i}}{\partial out_{h1}}}) * \frac{\partial out_{h1}}{\partial H1} * \frac{\partial H1}{\partial w_{1}}
$
It feels like it is complicated, but really we are going back layer by layer to get the respective value. As w1 feeds into neuron H1 and H1 is connected to Y1 and Y2. Moving backwards, we are differentiating the error function following which Y1 and Y2 (the activation function and the function of Weights) . That leads us to H1 where we differentiate its activation function and its respective function of weights.
This is how we back-propagate the errors and update all the weights. Once we update all the weights, that is one epoch or pass over the dataset. Further, we start the entire process of forward pass and backward pass again. This process is repeated for multiple times with the purpose of minimizing error.
When do we stop?
We stop prior to over-fitting that is we want the minimum validation error but we do not want the training error to be lower than the validation error.
Hopefully, this explains the entire process of how neural networks actually work and sheds some light on gradient descent and back-propagation.
Activation: We have talked about activation functions in the past posts, but let’s understand in more detail the different types of activation functions and explore their characteristics.
The post Mechanics of Deep Learning appeared first on Data Science Discovery.
]]>The post Deep Learning Architecture appeared first on Data Science Discovery.
]]>What are Neural Networks made of? Understanding the different components and the architecture of Neural Networks.
In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.
The introduction to neural networks and a general idea behind the inspiration for such an algorithm has been discussed in the previous post. We will talk about the building blocks of neural networks in detail in future posts, but in this post we focus on the overall structure of Neural networks and discuss some of the components.
Now, let’s briefly discuss the elements of a neural network.
Neural networks are a set of algorithms, inspired by the working of the human brain. These algorithms are designed to recognize patterns. Neural networks consist of layers which are made of nodes. These nodes are where all the calculations happen.
Each input has its own relative weight. Weights are adaptive coefficients that determine the intensity of the input signal as registered by the artificial neuron. Using techniques like back-propagation discussed here, the weights are updated with each iteration in order to reduce the error. For now, all we need to know is that the weights will be updated using special algorithms and that these algorithms require differentiation. So weights will be updated overtime but when we start training a neural network but:How do we initialize the weights?
He-et-al Initialization In this method, the weights are initialized while keeping in mind the size of the previous layer. That is we are taking into account the number of neurons in the previous layer. This helps attain a global minimum of the cost function faster.The weights are still random but differ in range. This initialization is more controlled here. More details about this technique are available here.
There are several techniques which can be used for initialization but the techniques mentioned here will give you some idea of how the weights as a component fits in neural networks.
A neural network is the grouping of neurons into layers and there can be many layers between the input and output layer. Most applications require networks that contain at least the three layers – input, hidden, and output. Each neuron in the hidden layer will be connected to all the neurons in the previous layer. We can start with these two types of basic perceptrons. They feed information from the front to the back and therefore are called Feed Forward networks.
Single Layer Feed Forward Neural Network consists of a single layer, that is it will only have the input and output layer. A single-layer perceptron can only be used for very simple problems such as classification classes separated by a line.
Multi Layer Feed Forward Neural Network consists of one or more hidden layers, whose computation nodes are called hidden neurons or hidden units. A Multilayer Perceptron can be used to represent convex regions thus it can separate and work in some high-dimensional space.
Now, we know it makes sense to have multiple layers especially when dealing with images or complex data.
How do we decide on what architecture to use? How many hidden layers should be used?
There has to be a trade-off and there is no definite answer to this question. However, I can suggest you the following:
Experimentation: Find out what works best for your data given the computational constraints.
Intuition or Google: Based on experience of past models used you can come up with an answer. If you have a standard DL problem such as an image classification, you can Google to find out what others have used (Resnet,vgg and so on).
Search: Try random or grid search for different architectures and choose the one giving the best score.
There are several different architectures shown in the image below. To summarize what are the parameters that govern or define the architecture:
Inside the Black box: What is going on inside this Black box algorithm? Trying to build intuition and understanding of what is going on in the different layers of a neural network. Let’s continue with the learning in this next article where we take a closer look at what happens with the different neurons and respective layers of a Neural Network.
The post Deep Learning Architecture appeared first on Data Science Discovery.
]]>The post Deep Learning Invasive Species appeared first on Data Science Discovery.
]]>Don’t get alarmed, we are going to put what we have learnt into practice on a playground kaggle data set explaining the code along the way.
In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.
This was coded sometime back and utilizes the library Fastai version 0.7, however recently there have been some updates in the library and new releases in pytorch as well. The current code will no longer work with Fastai v1, while there are still some important concepts that can be learned from this code such as:
We have covered some basic concepts regarding what neural networks are and how do they work. However, I feel it has been too much theory and while learning any new concept it is also important to see that theory in action. Let’s start!!!
Let’s pick up a playground problem from Kaggle. Invasive species can have damaging effects on the environment, the economy, and even human health. Consider, tangles of kudzu that overwhelm trees in Georgia while cane toads threaten habitats in over a dozen countries worldwide. This means it is a very important to track and stop the spread of these invasive species. Think of how costly and difficult it will be to undertake this task at a large scale. Trained scientists would be required to visit designated areas and take note of the species inhabiting them. Using such a highly qualified workforce is expensive, time inefficient, and insufficient since humans cannot cover large areas when sampling.
Looks like a very interesting use case for Deep Learning.
What we need is a labeled dataset of images marked as invasive or safe. Our algorithm will take care of the rest. You can start a kernel (python jupyter notebook) using this link and follow along. Few settings to keep in mind, make sure that you have GPU and internet enabled. There are several libraries in python for deep learning however, we will use fastai.
Link The full code is available here.
Let’s start coding!!!
# Get automatic reloading and inline plotting %reload_ext autoreload %autoreload 2 %matplotlib inline
Just some basic commands as practice, autoreload reloads modules automatically before entering the execution and matplotlib inline is a magic command that plots your outputs better.
### Import Required Libraries # Using Fastai Libraries from fastai.imports import * from fastai.transforms import * from fastai.conv_learner import * from fastai.model import * from fastai.dataset import * from fastai.sgdr import * from fastai.plots import * import numpy as np import pandas as pd import torch import os PATH = "../input" print(os.listdir(PATH)) TMP_PATH = "/tmp/tmp" MODEL_PATH = "/tmp/model/" sz= 224 bs = 58 arch = resnet34
Defining some variables:
I know in this series we have not yet covered how the convolution function and in particular how CNN’s work. However, for now all we need to know is that CNN is a type of neural network popular for image classification and Resnet is a type of architecture. Resnet-34 has 34 layers!
The programming framework used to behind the scenes to work with NVidia GPUs is called CUDA. Further, to improve performance, we need to check for NVidia package called CuDNN (special accelerated functions for deep learning).
### Checking GPU Set up print(torch.cuda.is_available()) print(torch.backends.cudnn.enabled)
Both of these should be true.
Now let’s look at what form the data is in, that is we need to understand how the data directories are structured, what the labels are and what some sample images look like. f’ is a convenient way to reference a path/string.
files = os.listdir(f'{PATH}/train')[:5] ## train contains image names print(files) img = plt.imread(f'{PATH}/train/{files[0]}') plt.imshow(img); print(img.shape)
We get the height, width and channels using img.shape. In img[:4,:4], img is a 3 dimensional array giving us the value for Red Green Blue pixel values. The image above should give us an idea of the height of the image. Now, let’s split the data into train and validation set.
label_csv = f'{PATH}/train_labels.csv' n = len(list(open(label_csv))) - 1 # header is not counted (-1) val_idxs = get_cv_idxs(n) # random 20% data for validation set print(n) #Total Data size print(len(val_idxs)) #Validation dataset size
label_df = pd.read_csv(label_csv) ### Count of both classes label_df.pivot_table(index="invasive", aggfunc=len).sort_values('name', ascending=False)
Label CSV contains the name and the corresponding label (1 or 0) where 1 means it has an invasive tag.Table 1: Target Variable Distribution
Label | Count |
---|---|
1 | 1448 |
0 | 847 |
tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1) data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}/train_labels.csv', test_name='test', val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)
tfms stands for transformations. tfms_from_model takes care of resizing, image cropping, initial normalization and more.A pre-defined list of functions are carried on in transforms_side_on. We can also specify random zooming of images up to specified scale by adding the max_zoom parameter.
With ImageClassifierData.from_csv we are just putting together everything (train, validation set, the labels and batch size).
fn = f'{PATH}/train' + data.trn_ds.fnames[0] #img = PIL.Image.open(fn) size_d = {k: PIL.Image.open(f'{PATH}/' + k).size for k in data.trn_ds.fnames} row_sz, col_sz = list(zip(*size_d.values())) row_sz = np.array(row_sz); col_sz = np.array(col_sz) plt.hist(row_sz);
A plot of the distribution of the size of the images. Ideally, we want all images to have a standard size to allow easier computation.
Our first model: To make the process quick we will first run a pre-trained model and observe the results. Further, we can tweak the model for improvements. A pre-trained model means a model created by some one else to solve a different problem, the weights corresponding to the activation function are saved/trained based on their dataset. We will try out their weights as is, that is instead of coming up with our own weights specific to our dataset, we will just use their weights. This is what we call transfer learning.
Is that a good idea?
Well, usually these weights are attained by training on a very large dataset for example Imagenet. It helps speed up the your training process.
We have train set with 1836 images and test set with 1531 which is not much to attain a high accuracy model where weights are trained from scratch. Further, in the article regarding the black box we had observed how gradients and edges are found in the initial layer of a neural network. That is useful information for our use case as well.
Let us form a function to get the data and resize images if necessary.
def get_data(sz, bs): # sz: image size, bs: batch size tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1) data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}/train_labels.csv', test_name='test', val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs) return data if sz > 500 else data.resize(512,TMP_PATH) # Reading the jpgs and resizing is slow for big images, so resizing them all to standard size first saves time
data = get_data(sz, bs) learn = ConvLearner.pretrained(arch, data, precompute=True,tmp_name=TMP_PATH, models_name=MODEL_PATH) learn.fit(1e-2, 3)
ConvLearner.pretrained builds learner that contains a pre-trained model. The last layer of the model needs to be replaced with the layer of the right dimensions. The pretained model was trained for 1000 classes therfore the final layer predicts a vector of 1000 probabilities. However, what we need is only a two dimensional vector. The diagram below shows in an example how this was done in one of the earliest successful CNNs. The layer “FC8” here would get replaced with a new layer with 2 outputs.
Parameters are learned by fitting a model to the data. Hyperparameters are another kind of parameter, that cannot be directly learned from the regular training process. These parameters express “higher-level” properties of the model such as its complexity or how fast it should learn. In learn.fit we provide the learning rate and the number of epochs (times we pass over the complete dataset).
The output of learn.fit is:Table 2: Loss/Accuracy By Epoch
epoch | trn_loss | val_loss | accuracy |
---|---|---|---|
0 | 0.379021 | 0.196531 | 0.932462 |
1 | 0.285149 | 0.168239 | 0.947712 |
2 | 0.229199 | 0.14343 | 0.947712 |
94% accuracy on our first model!!!
Let’s form some function to try and understand what the model is doing correct and wrong. we will explore:
# this gives prediction for validation set. Predictions are in log scale log_preds = learn.predict() print(log_preds.shape) preds = np.argmax(log_preds, axis=1) # from log probabilities to 0 or 1 probs = np.exp(log_preds[:,1]) # pr(1) # Where Species = Invasive is class 1 def rand_by_mask(mask): return np.random.choice(np.where(mask)[0], min(len(preds), 4), replace=False) def rand_by_correct(is_correct): return rand_by_mask((preds == data.val_y)==is_correct) def plots(ims, figsize=(12,6), rows=1, titles=None): f = plt.figure(figsize=figsize) for i in range(len(ims)): sp = f.add_subplot(rows, len(ims)//rows, i+1) sp.axis('Off') if titles is not None: sp.set_title(titles[i], fontsize=16) plt.imshow(ims[i]) def load_img_id(ds, idx): return np.array(PIL.Image.open(f'{PATH}/'+ds.fnames[idx])) def plot_val_with_title(idxs, title): imgs = [load_img_id(data.val_ds,x) for x in idxs] title_probs = [probs[x] for x in idxs] print(title) return plots(imgs, rows=1, titles=title_probs, figsize=(16,8)) if len(imgs)>0 else print('Not Found.') def most_by_mask(mask, mult): idxs = np.where(mask)[0] return idxs[np.argsort(mult * probs[idxs])[:4]] def most_by_correct(y, is_correct): mult = -1 if (y==1)==is_correct else 1 return most_by_mask(((preds == data.val_y)==is_correct) & (data.val_y == y), mult)
Let’s take a look at what we get if we were to call these functions. Keep in mind our classification threshold is 0.5.
# 1. A few correct labels at random plot_val_with_title(rand_by_correct(True), "Correctly classified")
# 2. A few incorrect labels at random plot_val_with_title(rand_by_correct(False), "Incorrectly classified")
# Most correct classifications: Class 0 plot_val_with_title(most_by_correct(0, True), "Most correct classifications: Class 0")
# Most correct classifications: Class 1 plot_val_with_title(most_by_correct(1, True), "Most correct classifications: Class 1")
# Most incorrect classifications: Actual Class 0 Predicted Class 1 plot_val_with_title(most_by_correct(0, False), "Most incorrect classifications: Actual Class 0 Predicted Class 1")
# Most incorrect classifications: Actual Class 1 Predicted Class 0 plot_val_with_title(most_by_correct(1, False), "Most incorrect classifications: Actual Class 1 Predicted Class 0")
# Most uncertain predictions most_uncertain = np.argsort(np.abs(probs -0.5))[:4] plot_val_with_title(most_uncertain, "Most uncertain predictions")
Scope of Improvement:
## How does loss change with changes in Learning Rate (For the Last Layer) learn.lr_find() learn.sched.plot_lr()
The method learn.lr_find() helps you find an optimal learning rate. It uses the technique developed in the 2015 paper Cyclical Learning Rates for Training Neural Networks, where we simply keep increasing the learning rate from a very small value, until the loss stops decreasing.
# Note that the loss is still clearly improves till lr=1e-2 (0.01). # The LR can vary as a part of the stochastic gradient descent over time. learn.sched.plot()
We can see the plot of loss versus learning rate to see where our loss stops decreasing:
Now, that we have an idea of how to select our learning rate. To set the number of epochs, we just need to ensure that there is no over-fitting. Let’s talk about data augmentation.
Data augmentation is a good step to prevent over-fitting. That is, by cropping/zooming/rotating the image, we can ensure that the model does not learn patterns specific to the train data and generalizes well to new data.
def get_augs(): tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1) data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}/train_labels.csv', bs = 2, tfms=tfms, suffix='.jpg', val_idxs=val_idxs, test_name='test') x,_ = next(iter(data.aug_dl)) return data.trn_ds.denorm(x)[1] # An Example of data augmentation ims = np.stack([get_augs() for i in range(6)]) plots(ims, rows=2)
With precompute = TRUE, all layers of the Neural network are set to frozen excluding the last layer. Thus we are only updating the weights in the last layer with our dataset. Now, we will train the model with the option precompute as false and cycle_len enabled. Cycle Length uses a technique called stochastic gradient descent with restarts (SGDR), a variant of learning rate annealing, which gradually decreases the learning rate as training progresses. In other words, SGDR reduces the learning rate every mini-batch, and reset occurs every cycle_len epoch. This is helpful because as we get closer to the optimal weights, we want to take smaller steps.
learn.precompute=False learn.fit(1e-2, 3, cycle_len=1)
Table 3: Loss/Accuracy By Epoch
epoch | trn_loss | val_loss | accuracy |
---|---|---|---|
0 | 0.221001 | 0.1623 | 0.943355 |
1 | 0.232999 | 0.179043 | 0.941176 |
2 | 0.224435 | 0.148815 | 0.947712 |
Calling learn.sched.plot_lr() once again:
To unfreeze layers however, we will call unfreeze. We will also try differential rates for the respective layers.
learn.unfreeze() lr=np.array([1e-4,1e-3,1e-2]) learn.fit(lr, 3, cycle_len=1, cycle_mult=2)
Table 4: Loss/Accuracy By Epoch
epoch | trn_loss | val_loss | accuracy |
---|---|---|---|
0 | 0.323539 | 0.178492 | 0.923747 |
1 | 0.247502 | 0.132352 | 0.949891 |
2 | 0.192528 | 0.128903 | 0.954248 |
3 | 0.165231 | 0.101978 | 0.962963 |
4 | 0.141049 | 0.106319 | 0.960784 |
5 | 0.121947 | 0.103018 | 0.960784 |
6 | 0.107445 | 0.100944 | 0.965142 |
Improved our model, 96.5% accuracy…
Above, we have the learning rate of the final layers. The learning rates of the earlier layers are fixed at the same multiples of the final layer rates as we initially requested (i.e. the first layers have 100x smaller, and middle layers 10x smaller learning rates, since we set lr=np.array([1e-4,1e-3,1e-2]).
To get a better picture, we can use Test time augmentation (learn.TTA()), that is we use data augmentation techniques on our validation set. Thus, by making predictions on both the validation set images and their augmented images, we will be more accurate.
Our confusion matrix:
Our final accuracy was 96.73% and upon submission to the public leader-board we got 98%.
Data Exploration:
Models Tweaking:
The post Deep Learning Invasive Species appeared first on Data Science Discovery.
]]>The post Deep Learning Black box appeared first on Data Science Discovery.
]]>Let’s try to take a sneak peak inside the black box of deep learning and try to build some intuition along the way.
In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources and written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.
We have covered the origins and understood a little bit about the structure of neural networks in the previous articles. However, before we further dive into the math behind the working of neural networks, we need to polish our understanding of what is going on inside the black box.
Deep learning algorithms are mostly a black box. We do not know what patterns are being observed that trigger an activation function. We can make a guess for example when it classifies a “Dog” in Cats vs Dogs dataset, it probably saw the ears or the shape of the dog’s face. But this uncertainty would not work when these algorithms are being used in self driving cars. In such use cases we need to know why the algorithm is working the way it is.
In Neural networks it is not necessary that a Neuron will be fired up for all the images. That is, a neuron will be activated only for a select features that are present in the input images.
Well, some light was shed on feature visualization by Matthew D. Zeiler and Rob Fergus in a research paper that is available here.
They put together a novel method to decode these features. First, they trained a normal CNN ( convolutional neural network a type of Neural Networks) to classify images. At the same time, they also trained a backward looking network.
To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached.
If we pass image A into our CNN and it passes through layer K and only neuron M ends up being activated. Now, the backward looking network (Deconvnet) will be used to reconstruct the status in the previous step. That is, based on the output of layer K, we have set all other activations to 0 except M and now we are trying to revert whatever activity happened in the previous layer (neuron M).
Thus, our goal here is to understand what feature activates a neuron. Let’s say we are training our network on cats vs dogs dataset, now we start focusing on a single neuron and ignoring all other neurons with the purpose of understanding what activated that neuron. Maybe the dog’s ears are the defining feature for this neuron and that is what this neuron looks for in every image. Now using this methodology we can explore how things are working layer by layer. In the images below you will notice the actual images and what the neural network is observing:
In layer 1 CNN is able to identify color gradients and as we look at deeper layers, more complex patterns. The patterns are emerging going from gradients to edges/shapes to complex features like eyes.
It was necessary to supply several images to the network to see what activates the selected neuron. This becomes computationally intensive. There is another way, what if we supply a image created with random pixels and try to find out what would excite one particular neuron. We use an image similar to the one shared below, and run it through a neural network with only one neuron activated. The neural network is trying to understand how to change the color of each pixel to increase the activation of that neuron. More information is available here in the paper by Jason Yosinski.
So how do these activated images look like:
Hope this gives you a sneak peak into how neural networks work especially with image data. If you wish to further explore the same, please have a look at this amazing blog post at distill pub.
Let’s say you are working on predicting the future sales of a retail store. A neuron might get activated or give higher weights to certain inputs. For example: the variables item category and season of sale. For simplicity try to think of this as weights given in linear regression. Why would these two variables cause the needle to shift?
Well, maybe there are some seasonal products in the data set which activate that particular neuron. Similarly we can develop some intuition of which variables are influencing our neurons.
Let’s dive into the core of neural networks. Understand the concept of Gradient Descent and Back-propagation to get some idea of how Neural Networks work. Warning some math involved! Don’t worry, we will first try to explain it in an intuitive manner and then explore some math behind it.
The post Deep Learning Black box appeared first on Data Science Discovery.
]]>The post Deep Learning Activation Function appeared first on Data Science Discovery.
]]>Understand how an activation function behaves for different neurons and connect it to the grand architecture. The concept of different type of activation functions explored in detail.
In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources and written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.
We had briefly discussed activation functions in the blog regarding the architecture of neural networks. But lets improve are understanding by diving into this topic further.
Activation functions are essentially the deciding authority, on whether the information provided by a neuron is relevant or can be ignored. Drawing a parallel to our brain, there are many neurons but all the neurons are not activated by an input stimuli. Thus, there must be some mechanism, that decides which neuron is being triggered by a particular stimuli. Let’s put this in perspective:
The output signal will be attained only if the neuron is activated. Consider, the neuron A that is providing the weighted sum of inputs along with a bias term.
Thus, we are simply doing some linear matrix transformations and as mentioned in the deep learning architecture blog, just doing a linear operation is not strong enough. We need to add some Non-Linear Transformations, that is where Activation functions come into the picture.
Also, the range of this function is -inf to inf. When we get an output from the neural network, this range does not make sense. For example if we are classifying images as Dogs or Cats, what we need is a binary value or some probability thus we need a mapping to a smaller space. The output space could be between 0 to 1 or -1 to 1 and so on depending on the choice of function.
So to summarize we need the activation functions to introduce non-linearities, get better predictions and reduce the output space.
Now, let’s do a simple exercise, given this idea regarding activation of neurons how would you come up with an activation function. What we want is a binary value suggesting if a Neuron is activated or not.
First thing that comes to mind is defining a threshold. If the value is beyond a certain threshold declare it as activated. Now, if we are defining this function for the space 0 to 1, we can easily say for any value above 0.5 consider the neuron activated.
Wow! We have our first activation function. What we have defined here is a Step Function also known as Unit or Binary Step function.
Advantage
Disadvantage
Thus we want that the Activation function is differentiable because of how back-propagation works.
Now let’s take look at the large picture. There are multiple neurons in our neural network. We had discussed in the blog regarding the intuitive understanding of neuron networks how neuron networks look for patterns in images. If you haven’t read it or have some idea about it, all you need to know is that different neurons might select or identify different patterns. Revisiting the Dog vs Cat image identification example, if multiple neurons are being activated what will happen?
With the use case defined above, let’s try to a linear function as we have figured out that a binary function didn’t help much. f(X) = CX, straight line function where activation is proportional to the weighted sum from neuron. If more than one neuron gets activated then we can take the max value for the neuron activation values, that way we have only 1 neuron to be concerned about.
Oh wait! the derivative of this function is a constant value. f’(X) = C.What does that mean?
Well, this means that every time we do a back propagation, the gradient would be the same and there is no improvement in the error. Also, with each layer having a linear transformation, the final output is also a linear transformation of the input. Further, a space of (-inf,inf) sounds difficult to compute. Hence, not desirable.
Let’s pull out the big guns. A smoother version of the step function. It is non-linear and can be used for non-binary activations. It is also continuously differentiable.
Most values lie between -2 and 2. Further, even small changes in the value of Z results in large changes in the value of the function. This pushes values towards the extreme parts of the curve making clear distinctions on prediction. Another advantage of sigmoid activation is that the output lies in the range between 0 and 1 making an ideal function for use cases where probability is required.
That’s sounds all good! Then what’s the issue? After +3 and -3 the curve gets pretty flat. This means that the gradient at such points will be very small. Thus, the improvement in error will become almost zero at these points and the network learns slowly. This is known as vanishing gradients. There are some ways to take care of this issue. Others issue are the computation load and not being zero-centered.
Hyperbolic tangent activation function is very similar to the sigmoid function.
Compare the formula of tanh function with sigmoid: tanh(x) = 2 sigmmoid(2x) – 1
To put it in words, if we scale the sigmoid function we get the tanh function. Thus, it has similar properties to the sigmoid function. The tanh function also suffers from the vanishing gradient problem and therefore kills gradients when saturated. Unlike sigmoid, tanh outputs are zero-centered since the scope is between -1 and 1.
To truly address the problem of vanishing gradients we need to talk about Rectified linear unit (ReLU) and Leaky ReLU. ReLU (rectified linear unit) is one of the most popular function which is used as hidden layer activation function in deep neural network.
g ( z ) = max { 0 , z }
When the input x < 0 the output is 0 and if x > 0 the output is x. As you can see that a derivative exists for all values except 0. The Left derivative is 0 while right derivative is 1. That’s a new issue, how will it work with gradient descent. In practice at 0 it is more likely that the true value close to zero or rounded to zero, in other words it is rare to find this issue in practice. Software implementations of neural network training usually return one of the one-sided derivatives instead of raising an error. ReLU is computationally very efficient but it is not a zero-centered function. Another issue is that if x < 0 during the forward pass, the neuron remains inactive and it kills the gradient during the backward pass. Thus weights do not get updated, and the network does not learn.
This is a modification of ReLU activation function. The concept of leaky ReLU is when x < 0, it will have a small positive slope of 0.1. This feature eliminates the dying ReLU problem, but the results achieved with it are not consistent. Though it has all the characteristics of a ReLU activation function, i.e., computationally efficient, converges much faster, does not saturate in positive region.
f(x) = max(0.1*x,x)
There are many different generalizations and variations to ReLU such as parameterized ReLU.
To tie things together let’s discuss one last function. It is often used in the final layer of Neural networks. The softmax function is also a type of sigmoid function that is often used for multi-class classification problems.
Look at the numerator, as we are taking an exponential of Zj, it will result in a positive value. Further, even small changes in Zj result in largely variant values (exponential scale). The denominator is the summation of all Exp(Zj) that is the probabilities end up adding to 1.
This makes it perfect for classifying multiple classes. For example, if we want to detect multiple labels in an image such as for satellite images of a landscape you might find water, rain forests, land and so on as the labels.
Many activation functions and their characteristics have been discussed here. But the final question is when to use what?
There is no exact rule for choice rather the choice is determined on the nature of your problem. Keep the characteristics of the activation functions in mind and choose the one that suits your use case and will provide faster convergence. The most used in practice is to use ReLU for the hidden layers and sigmoid (binary classification example Cat vs Dog) or softmax (multi class classification) in the final layer.
The post Deep Learning Activation Function appeared first on Data Science Discovery.
]]>The post Deep Learning Origins appeared first on Data Science Discovery.
]]>Every superhero has an Origin story! We will take a look at the overall motive and the inspiration behind (Deep Learning) neural networks to understand what they are.
In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.
Have you ever wondered how our brain works? How it is easily able to detect and recognize objects?
Walter Pitts and Warren McCulloch created a computer model based on the neural networks of the human brain in 1943. This was where it all began. The first model which was the birth of neural network algorithms.
Overtime, there have been a lot of developments but the purpose of this blog is not a history lesson but to look at their intention, they were trying to develop a model based on how the neural networks of the human brain work. That is great but how does our brain work.
Neurons are the driving force behind every memory and action.
The dendrites are the receivers of the signal and the Axon is the transmitter of the signal. Imagine several neurons where signal from the dendrite of one neuron is passed through the axon on to the dendrites of the next neuron. This is how these signals/electrical impulses are transferred from one neuron to another.
To take an example we can say, that the inputs are the senses such as smell, taste and so on. The brain makes sense of all these inputs. Similarly, Neural networks have an input layer and based on some transformations we get the output signal/prediction. The neuron is where all the action is happening. That is there is some mathematical transformation taking place. We will discuss these transformations in more detail in later posts.
As there are so many neurons in our brain, does each and every one fire up with each input? We know, different parts of our brain are used with different types of sensory inputs. Well, that is where the activation function in neural networks comes in. It is a way to decide whether a neuron is to be fired up or not. For explanation purposes, in the image above we have an input layer followed by a single neuron. In reality, there are many neurons and many layers involved. A neuron is either getting the input data or the resultant values of other neurons (perceptrons).
Linear regression bears some resemblance to the initial state of neural networks. Frank Rosenblatt’s Perceptron was the first idea conceived specifically as a method to make machines learn.
It took binary inputs and multiplied it with continuous valued weights. The sum of the weighted values (+ bias) will be mapped to 0 or 1 based on what it is closer to. This function of weighted sum + bias is very similar to linear regression. In the image shared below, xi’s are the input values and wi’s are the respective weights.
We will discuss this in more detail in the next part of this series, the architecture of neural networks.
The key takeaways here:
The post Deep Learning Origins appeared first on Data Science Discovery.
]]>The post Deep Learning – Introduction appeared first on Data Science Discovery.
]]>What is this mythical beast I keep hearing about? Today, Deep Learning is a buzzword for a well deserved reason. Let’s do a deep dive into this subject and slay this beast.
What is this mythical beast I keep hearing about? Today, Deep Learning is a buzzword for a well deserved reason. Some of it’s applications include:
We all must have come across at least one application on our phone that uses deep learning. We also keep on hearing about Self-driving cars, reminds me of back to the future, I-robot and so on. There are so many movies where self driving cars where imagined. Now, we finally have them as a reality. To further understand what made all this possible and how it works, let’s tackle this topic layer by layer:
This blog series has been put together by using several references and it is necessary to point out some of them so that other readers can also take inspiration and understanding from these sources.
I will continue to update this list of references and add more articles as we progress on our deep learning exploration.
Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.
About the writers:
The post Deep Learning – Introduction appeared first on Data Science Discovery.
]]>The post UMAP appeared first on Data Science Discovery.
]]>Before we discuss this new and exciting development in dimension reduction techniques (UMAP), we need to know what are the reasons to use dimension reduction and the different techniques available in the data science arsenal. Please have a look at it here.
To give you an idea of UMAP in action (MNIST dataset where we are classifying the images of digits 0-9). Note, how these clusters (for explanation purposes let’s call these concentration of points a cluster) are formed in this image. UMAP tries to preserve these local structures (within the cluster) and ensure some separation between these clusters.
Firstly, I have to mention Leland McInnes and John Healy for developing this novel manifold learning technique for dimension reduction. If you want to reference their paper you can do the same here. This technique can be used for visualization similar to t-SNE, but also for general non-linear dimension reduction. t-SNE works very well but it has limitations such as loss of large-scale information and being computationally intensive.
Let’s start with some basics and slowly develop this concept and along the way develop understanding around UMAP. We can say that there are two main steps we want to accomplish here:
An n-dimensional manifold is a space that locally looks like n-dimensional Euclidean space but has a different global structure. To put it more plainly, consider a sheet of paper, if you take any point on this sheet it is easy to draw a small straight line. Now, we can take the paper and fold it to form a sphere. Now, given a spherical surface the line is curved.
Here, the straight line represents a euclidean space, and if n such lines can be drawn on this paper from n points then they will be an n-dimensional euclidean space. The sphere is the global structure which is not euclidean.
Let’s get acquainted with some ideas of Topology. Topology may sound scary but it is simply the location of a data point with regards to other points of a set. In general, a space is just a set of points and if we associate the notion of the distance between points, we call these metric spaces. Topological spaces are just generalizations of metric spaces.
Further, a cover Y of a topological space X is essentially a space that contains all points of X and some surrounding points in X’s neighborhood. If it sounds confusing consider this example, imagine you are looking from the top at a man holding an umbrella, if the man is the data point then the umbrella is the respective cover. The umbrella is covering the man and some surrounding area as well.
In geometry, a simplex is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions. A zero-simplex is simply a vertex, a one-simplex is an edge and so on we can form a k point simplex. Thus, combining data points to form these structures as shown below:
In algebraic topology, simplices are used as building blocks to construct an interesting class of topological spaces called simplicial complexes. These spaces are built from simplices glued together in a combinatorial fashion.
Consider that we have a data set represented in the graph below. We have placed a cover around each data point representing their neighborhood.
This data can be represented in the form of Simplicial Complexes that is by using 0-simplices and 1-simplices such as:
A neighborhood graph based approach should capture manifold structure when doing dimension reduction. The idea in UMAP is to find a topological representation of the data in a lower dimensional space. It is like an artist is trying to paint a portrait of a person. The artist wants to show a true representation of the subject’s features while converting the 3D into a 2D representation.
The algorithm is founded on three assumptions about the data
In practical application, when is data ever that clean. One question that comes to mind is that are we loosing any information by forming this representation.
Further, what should be the size of the neighborhood (cover) of a data point such that we do not loose the topological information. Another point that is evident in the image above, what if the data points are concentrated at a specific location and at some locations there are no points at all. Also, for a concentration of points, when building simplicial complexes, which data points are to be connected.
Too many questions, Let’s try to answer some of these.
In an ideal world with a uniformaly distributed data, the distance between two points would be a perfect radii to form the cover. But here we can consider varying distances based on the data point and it’s respective neighborhood.
simply put states that defining a simplicial complex is equivalent (homotopically) to a union of a cover thus the topological information is not lost.
defined on this manifold essentially means we are trying to make sense of the angles and distances between data points. Defining a Riemannian metric is necessary as we are trying to preserve the topology over here.
To build simplicial complexes we need to consider the k nearest neighbors to a data point and these can be connected via an edge. But, what will be an appropriate k?
A small k implies attention to the fine structure of the topology while a large k implies that we want to estimate based on larger regions at the cost of the finer structure. The right k depends on the distribution of distances in the data set.
Now, if a data point has 5 neighbors, instead of having a binary answer (Yes or No) to whether they are to be connected, we can have a fuzzy answer (a value between 0 and 1). Consider, weights for each connective edge based on the distance. The weights will help us form the simplicial complexes.
Interpret the weights as the probability of the simplex existing
The assumption of local connectivity comes into picture, that is any point on the manifold there is some sufficiently small neighborhood of the point that is connected. Thus, there should be no isolated points. Of course, if the data set has some points that are very widely spread, while connecting them our confidence in the estimations we make will be low. But it is necessary to have the assumption of local connectivity as it also help tackle the curse of dimensionality.
The distance to the first nearest neighbor can be quite large, but the distance to the tenth nearest neighbor can often be only slightly larger (in relative terms).
The local connectivity constraint ensures that we focus on the difference in distances among nearest neighbors rather than the absolute distance (which shows little differentiation among neighbors).
As these data points do not lie on a 2D graph, 2 data points can have conflicting distances to each other. That is, there can be multiple edges connecting these points. From point a’s perspective the distance from point a to point b might be 1.5, but from the perspective of point b the distance from point b to point a might only be 0.6.
To resolve such cases, we combine the weights such as: a + b – a*b
So far we have only found a way to represent the data without loosing much of the topological structure. We still need to embed this information on to a low dimensional space.
Consider:
The numerator in log function here is the weight function in the high dimensional space while the denominator is the weight function in the low dimensional space. This part provides an attractive force between the points, in other words we want to preserve the structure of the data points especially if the data points are clumped together. This is because this term will be minimized when the denominator is as large as possible, which will occur when the distance between the points is as small as possible.
This is like a repulsive force, as the term will be minimized by making the weight function of the low dimensional space as small as possible. In other words, we want to ensure that the gaps or separation in the global structure right.
The right balance between these two components is what we are looking for.
To get there faster that is finding the right low dimensional topological space that is a close representation of our high dimensional data, we can use:
In summary, I know that UMAP might look very complex but it is in fact quite an interesting technique. Further, in recent months UMAP has gained a lot of buzz and for good reason, it gives good results much faster than T-SNE.
Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.
About the writers:
Ankit Gadi: Driven by a knack and passion for data science coupled with a strong foundation in Operations Research and Statistics has helped me embark on my data science journey.
The post UMAP appeared first on Data Science Discovery.
]]>