Data Science Discovery https://www.datasciencediscovery.com Blog on the Path to your DS Journey Mon, 13 Apr 2020 18:15:36 +0000 en-US hourly 1 https://wordpress.org/?v=5.4.2 165621578 Generative Adversarial Networks Deep Dive https://www.datasciencediscovery.com/index.php/2020/03/16/generative-adversarial-networks-deep-dive/?utm_source=rss&utm_medium=rss&utm_campaign=generative-adversarial-networks-deep-dive https://www.datasciencediscovery.com/index.php/2020/03/16/generative-adversarial-networks-deep-dive/#respond Mon, 16 Mar 2020 17:45:24 +0000 http://www.datasciencediscovery.com/?p=720 If you are wondering what is so interesting about generative adversarial networks (GAN), please refer to the following link. In this article we dive further into the depths of GAN’s and understand how this technique works. There have been a lot of improvements in generative adversarial networks (GAN’s) over time, but let’s go to the […]]]>

If you are wondering what is so interesting about generative adversarial networks (GAN), please refer to the following link. In this article we dive further into the depths of GAN’s and understand how this technique works.

There have been a lot of improvements in generative adversarial networks (GAN’s) over time, but let’s go to the origin of it all in order to understand the concept.

Source

Intuition

Have you heard about the art forger Mark Landis? 

It is quite an interesting story. He had been responsible for submitting forgeries to several art museums and got even better at making them over time. He often donated his counterfeits to these museums with doctored documents and even dressed as a priest to avoid suspicion. Leininger (curatorial department) was the first person to pursue Landis. You can read more about it here. But for the purpose of explaining this concept we need limited knowledge of this event.

Imagine that you are in-charge (Leininger) of identifying if the presented painting is fake or authentic. Further, Landis is also making his first forgery.

At first, you find it easy to identify a fake. However, over time both Landis and you get better. Landis develops more sophisticated skills, making it increasingly difficult for you to spot fakes.

How it works?

To connect with the example in the previous section, consider the generator as Landis and discriminator as Leininger. However, here both the discriminator and generator are different neural networks which are both trying reduce their error.

The generator is trying to generate an output that fools the discriminator while the discriminator is trying to differentiate between actual and fake data. In other words, generative adversarial networks (GAN) is inspired by the zero-sum non-cooperative game where the generator is trying to maximize the number of times it fools the discriminator while the discriminator is trying to minimize the same.

These networks use back-propagation to reduce the error.

In other words, the generator and discriminator are two adversaries or opponents playing a game. They go back and forth against each other, improving their skill over time.

Discriminator

  • Classifies both real and fake data
  • Updates weights based on the discriminator loss

Generator

  • Random noise can be generated from any distribution, we usually chose a distribution easy to sample from and having dimensions lower than the output.
  • Transforms the random input into a more suitable form.

Loss Function

The loss function consists of two parts:

> Generator's Loss + Discriminator's Loss
> Loss while identifying real data points + Loss from generated / fake data.

$ \min_G \max_D V(D, G)= \mathbb{E}_{x\sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z\sim p_z(z)}[\log(1 – D(G(z)))] $

Looks Daunting! Let’s break it down.

Let’s start with the Generator’s loss.

Let D be the discriminator and G the generator.

To start with, consider G(z), which is the output of the generator neural network for the noise input z. Note that we have randomly sampled the noise from a probability distribution. The generator based on the weights it has learned is able to transform the noise into hopefully something more meaningful.

D( G(z) ) is the discriminator using the output of the generator as an input.

In other words, at this point we are trying to find out the probability that a fake instance (forgery) is real.

$ \mathbb{E}_{z\sim p_z(z)}[\log(1 – D(G(z)))] $

The above mentioned part can be summarized in the following points:

  1. D ( G(z) ) provides the probability of the generator’s transformed output being classified as real.
  2. Further, log( 1 – D ( G(z) ) ) is equivalent to – log ( D ( G(z) ) ).
  3. We are trying to calculate the expectation of all the values calculated in the point above. At the start of the training, the generator will be producing output which is far apart from the ground truth and the discriminator will find it easy to identify the fake data points.
  4. In other words, we are minimizing the loss for the generator that is E[ log( 1 – D ( G(z) ) )] or maximizing E[ log( D ( G(z) ) )]. Basically, as the generator gets better, the algorithm becomes stronger. That is if the generator creates better images closer to the ground truth the algorithm also starts looking for complex patterns to distinguish between an authentic and fake image.

The second part of the loss function is rather simple.

$ \mathbb{E}_{x\sim p_{data}(x)}[\log D(x)] $

  • The input here to the discriminator is the real data.
  • We are trying to maximize this part that is we want the discriminator to be able to recognize the real images better.

Training Process

  • Training process consists of simultaneous SGD.
  • On each step, two mini-batches are sampled, one from the real data and the other from generated / fake data.
  • Two gradient steps are made simultaneously, that is optimizing the errors for the Generator and the discriminator.

In this article we have not explored certain concepts in too much detail such as:

  • The different choices available for the loss function
  • Finding the optimal Discriminator
  • The divergence mechanism, in simplistic terms, it is the distance between two probabilistic distribution functions. In our case, we want the distance between the probability distribution of the generator (on fake data) to be close to the probability distribution of the discriminator (on real data).
  • Label smoothing, Batch Normalization and other tips

References

About Us

Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.

About the writers:

Ankit Gadi: Driven by a knack and passion for data science coupled with a strong foundation in Operations Research and Statistics has helped me embark on my data science journey.

]]>
https://www.datasciencediscovery.com/index.php/2020/03/16/generative-adversarial-networks-deep-dive/feed/ 0 720
Generative Adversarial Networks https://www.datasciencediscovery.com/index.php/2020/03/16/generative-adversarial-networks/?utm_source=rss&utm_medium=rss&utm_campaign=generative-adversarial-networks https://www.datasciencediscovery.com/index.php/2020/03/16/generative-adversarial-networks/#respond Mon, 16 Mar 2020 17:32:08 +0000 http://www.datasciencediscovery.com/?p=710 The buzz around Deep-fakes has reached far and wide, further it has been a candidate of conversation for several months. Let’s understand what the buzz is all about. Let’s look into the greatest hits and the most impressive applications by GANs (Generative Adversarial Networks), before we take a deep dive into the depths of the […]]]>

The buzz around Deep-fakes has reached far and wide, further it has been a candidate of conversation for several months. Let’s understand what the buzz is all about.

Source

Let’s look into the greatest hits and the most impressive applications by GANs (Generative Adversarial Networks), before we take a deep dive into the depths of the algorithm.

Impressive Applications

  • Single Image Super-Resolution (SRGAN): An example of industrial application is of producing high resolution magnetic resonance (MR) images faster and cutting wait times.
Source
  • Perceptual GAN (PGAN): Can we find a way to identify objects in a low resolution, noisy images? PGAN’s might be able to help you out.
  • Text to Photo-realistic Image Synthesis (StackGAN): Have you ever thought of creating an image out of thin air. This implementation of GAN’s takes a stab at such a challenging problem.
Source
  • High Resolution Image Synthesis: Self Driving cars need a lot of training data to learn how to drive safely. You have to start somewhere, and this is one of the techniques that help generate videos for training.

This type of image synthesis is a form of conditional GAN’s which have been known for several applications.

Inspiration

The list above is just a preview of some of the applications of generative adversarial networks (GAN). We at data science discovery, also felt inspired and started on our journey to discover this concept and dive into the depths of this topic.

I looked at several whitepapers to get familiar with this topic. Further, one of my fellows, Navin Manaswi, who at the time was working on a new book “Generative Adversarial Networks with Industrial Use Cases” helped out by sharing some of the chapters he had written.

Me and my colleague also decided to try and experiment with this technique on one of our ongoing projects. In most examples that we have seen, it is evident that it works well on images but what about structured data. However, that is a whole another story.

Deep Dive

In the next article we dive deeper into the concept by building an intuition and learning about the architecture of these neural networks.

References

About Us

Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.

About the writers:

  • Ankit Gadi: Driven by a knack and passion for data science coupled with a strong foundation in Operations Research and Statistics has helped me embark on my data science journey.
]]>
https://www.datasciencediscovery.com/index.php/2020/03/16/generative-adversarial-networks/feed/ 0 710
NLP with Deep Learning https://www.datasciencediscovery.com/index.php/2019/03/11/nlp-with-dl/?utm_source=rss&utm_medium=rss&utm_campaign=nlp-with-dl https://www.datasciencediscovery.com/index.php/2019/03/11/nlp-with-dl/#respond Mon, 11 Mar 2019 12:01:17 +0000 http://datasciencediscovery.com/?p=422 Text Classification Our focus is to solve text classification problem using deep learning. To reiterate the problem of NLP (Natural Language Processing) based text classification below: Problem In today’s world, websites have to deal with toxic and divisive content. Especially major websites like Quora which cater to large traffic and their purpose is to provide […]]]>

Text Classification

Our focus is to solve text classification problem using deep learning. To reiterate the problem of NLP (Natural Language Processing) based text classification below:

Problem

In today’s world, websites have to deal with toxic and divisive content. Especially major websites like Quora which cater to large traffic and their purpose is to provide a platform to people for asking and answering questions. A key challenge is to weed out insincere questions, those founded upon false premises or questions that intend to make a statement rather than look for helpful answers.

A question is classified as insincere if:

  • Non-neutral tone directed at someone
  • Discriminatory or contains abusive language
  • Contains false information

For more information regarding the challenge you can use the following link.

Code

The full deep learning code used is available here.

Deep Learning Approach

In part 1, we saw how machine learning algorithms could be applied to text classification. We had to identify and create a variety of features to reduce the complexity in the data. This required considerable effort but was essential for the learning algorithms to be able to detect patterns in the data.
In this section, we will approach the same problem using deep learning techniques. Deep learning has become the state of the art method for a variety of classification tasks. But first thing we need to understand is the motivation for the same, which has been outlined here:

Advantage:
  • Save Efforts: No need to spend time exploring the intricacies of the text. We can start with a plug and play neural network and further, evaluate if any pre-processing or data cleaning will be necessary. During our experimentation we found that there wasn’t much incremental value obtained from cleaning text.
  • Feature Extraction: Deep learning tries to learn high level features from data in an incremental manner. This removes the need to perform extensive feature engineering.
  • Superior Performance: Assuming sufficient data, deep learning outperforms other techniques.
Issues:

That all sounds great, but what is the difficult part of this task. Well the major problem is defining the right architecture. But how do we get started, our primary task should be to understand the type of neural network we wish to use and further the architecture of the neural network. In other words we need to make a choice based on the following parameters:

  • Type of Neural Network: Should we use an RNN or CNN or some form of a hybrid of both. As we are dealing with textual data, we need a sequence model. In other words, our model should be able to remember the past words used in a sentence in order to draw some value from the context of the sentence. This is where an RNN shines.
  • Size, Width & Depth: We need to decide on the total number of nodes in the model, the number of layers and the number of neurons in the respective layers.
  • Elements of a Neuron: That is the activation function, scaling, limiting and so on.
  • Components/Layers: The decision of all the components or layers involved such as the embedding layer, using dropout or max-pooling and so on. We also need to decide how these layers will be ordered and connected.

Let’s understand some key concepts before we proceed further:

Word embeddings

Words represented as real valued vectors are what we call word embeddings. The value associated with each word is either learned by using a neural network on a large dataset with a predefined task (like document classification) or by using an unsupervised process such as using document statistics. These weights can be used as a part of transfer learning to take advantage of the reduced training time and better performance.

Recurrent neural networks

Consider how we make sense of a sentence, we not only look at a word but also how it fits with the preceding words. Recurrent neural networks (RNN) take into consideration both the current inputs as well as the preceding inputs. They are suitable for sequence problems because their internal memory stores important details about the inputs that they received in the past which helps them precisely predict the output of the next time step.

GRU and LSTM are improved variations of the vanilla RNN, which tackle the problem of vanishing gradients and handling of long term dependencies.

Consider a simple use case that we are trying to infer the weather based on the conversations between two people. Let’s take the following text as an example

“We were walking on the road when it started to pour, luckily my friend was carrying an umbrella.”

The target variable here has a classification of “rain”. In such a case RNN has to keep “pour” in memory while considering “umbrella” to correctly predict rain. As the occurrence of umbrella alone is not definitive proof of rain. Similarly the occurrence of “didn’t” before “pour” would have changed everything.

LSTM:

  • Long Short Term Memory or LSTM has been an improvement over vanilla RNN as they are able to capture long term dependencies by introducing input, forget and output gates, which control what previous information needs to be stored and updated. In other words, if we need to store word 11 along with word 50 of a conversation to truly derive the context, LSTM is better equipped to do the same. LSTM is based on the following three major ideas:
    • Introducing a word to allow my neural network to learn from it.
    • Does a previously learned word continue to make sense or should I forget about it?
    • Creating the final Output that is getting the predicted value.

GRU:

  • Gated Recurrent Units also deal with long term dependencies in a similar fashion to LSTM. However, they combine the input and forget gates into a single update gate, resulting in a simpler design compared to LSTM. The update gate essentially decides what past information to hold on to and the reset gate as the name suggests decides what should be discarded or forgotten.
Bidirectional RNN

A bidirectional RNN first forward propagates left to right, starting with the initial time step. Then starting at the final time step, it moves right to left until it reaches the initial time step. This learning of representations by incorporating future time steps helps understand context better.

In other words let’s go back to our example of deciphering the weather based on conversations. It makes sense to make a leap from “pour” to “umbrella” starting reading from left to right. But what if we went right to left, that will just add to the power of the model as may be in another conversation we have a different occurrence pattern of words for example:

“I took out an umbrella as it started to pour.”.

Attention Layer

I think the best way to describe attention is by having a look at a basic CNN use case of image classification (Dogs vs Cats). If you are given an image of a dog what is the most defining characteristic that helps you differentiate. Is it the dog’s nose or ears? The attention mechanism blurs certain pixels and focuses on only a portion of the image. Thus, it assigns a weight and tells the model what to focus on.

In our context, attention model takes into account input from several time steps back and assigns a weight, signifying interest, to each of them. Attention is used to pay attention to specific words in the text sequence for example over a large dataset the attention layer will give more weightage to words like “rain”, “pour”, “umbrella” and so on.

Stochastic weight averaging

We used Stochastic weight averaging to update the weights of our network for the following reasons-

  • SWA can be applied to any architecture, dataset and shows good results. It is essentially an ensemble technique where you are deciding to start storing the weights at subsequent epochs and average them.
  • But wouldn’t that add to the computational load? SWA only requires the weights of one pre-trained model, to initialize and post that we store a running average to update weights at each epoch.

The image below (Illustrations of SWA and SGD with a Pre-activation ResNet-164 on CIFAR-100) shows how well SWA generalizes and results in better performance. On the left we have W1, W2 and W3 as weights of three independently trained networks and Wswa is the average of the three. This holds even though on the right we see a greater train loss by SWA as compared to SGD.

Source
Cyclical Learning Rate (CLR)

The essence of this learning rate policy comes from the observation that increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial effect. This observation leads to the idea of letting the learning rate vary within a range of values rather than adopting a stepwise fixed or exponentially decreasing value. That is, one sets minimum and maximum boundaries and the learning rate cyclically varies between these bounds based on a predefined function. It is possible that during gradient descent we are stuck at the local minimum but a cyclical learning rate can help jump out to a different location moving towards the global optimum.

Dropout

This a technique specifically used to prevent over fitting. It basically involves dropping some percentage of the units during the training process. By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections. The units are selected randomly. Dropout randomly zeros out activations in order to force the network to generalize better, do less overfitting, and build in redundancies as a regularization technique.

There are different ways to drop values. Think of pixels in an image, these pixels will be very correlated to its neighbours, in such a case randomly dropping a pixels will not accomplish anything. That is where techniques like spatial dropout come into the picture. Spatial dropout involves dropping entire feature maps. For explanation purposes consider an image cropped into smaller segments and each being mapped using a function. Out of all these mapped values we randomly delete some of them.
If we take textual data you can think of it as dropping entire phrases and forcing the model to generalize better.

Batch normalization

Rather than having varying ranges in your data we often normalize the data set to allow faster convergence. The same principle is used in neural networks for the input of the hidden layer. It involves a covariance shift which helps the network generalize better. That means even if the value in the train set and test set are vastly different, by normalizing it we reduce overfitting and help get better results.

Pooling

Global average and global max pooling reduce the spatial size of the feature map/ representation to one feature map for each category (classification task).

Let’s say that we have an image of a dog. Global average pooling will take the average of all the activation values and tell us about the overall strength of the image, i.e. whether the image is of a dog or not.
Global max pooling, on the other hand, will take the maximum of the activation values. This will help identify the strongest trait of the image, say, ears of the dog. Similarly, in the case of textual data, global max pooling highlights the phrase with the most information while global average pooling indicates the overall value of the sentence.

Model Architecture

Let’s dive deeper into the choices made and connect the dots between understanding the components of a neural network to actually forming one. Well, one of the most important factors that comes into play while deciding on the architecture is past experience and experimentation.

We first started with a very basic model (GRU) and plotted the accuracy and loss. We noticed during our experimentation that given the nature of our dataset it was very easy to overfit. Observe the graph below obtained on using a bidirectional GRU with 64 units/neurons and a single hidden layer (16 units):

Clearly, we start to overfit very quickly and we cannot keep the number of epochs high. Similarly, it also suggests that a very complex model (One with multiple layers and several nodes) will also be prone to overfitting. This also suggests the importance of regularization techniques like dropout in the architecture.

We also carried out error analysis to further get an idea of the performance of the baseline model.

Error analysis on the baseline model

We intended to understand the following:

What are insincere topics where the network strongly believes to be sincere ?

class 0 [0.94] Should we kick out Muslims from India?
class 0 [0.94] Why does Quora keep posting this leftist propaganda? 
               Are they owned by a liberal media conglomerate?
class 0 [0.94] Why don’t the Christian and Muslim quarter be under 
               Palestinian control and the Jewish quarter 
               be under Israeli control?

Our baseline model has shown an F1 score of 0.63 yet these sentences are being tagged as sincere. What is the issue?

Post a deep dive into the dataset we noticed that there are several misclassified cases in the dataset. In the competition details they have specified that the method used for classification is a combination of manual and some ML/rule based efforts. This also brings our attention to a real world scenario that can exist in any ML/DL project. Ideally, we would like to reclassify such cases with the wrong tag as the model is currently learning from misclassified examples as well. However, that can not be done given the constraints of the competition.

What are sincere topics where the network strongly believes to be insincere ?

class 1 [0.00000001] What astrological combinations are needed to obtain 
                     a scholarship for studies?
class 1 [0.00000001] How many karats needed for a karat farm?
class 1 [0.0000001] How many SITA devices are present in Ramayana?

What are insincere/sincere topics where the network are most uncertain how to classify ?

Keep in mind our chosen threshold for converting probabilities into a binary is: 0.4

class 0 [0.40] Why do most people not consider the US a 
               two party system, even though any  
               party besides Democrat/Republican 
               realistically has no chance for a win?
class 0 [0.40] Feigning ignorance of who wrote it, 
               could you have mistaken 
               "Something" for Paul McCartney number 
               if he had sung it instead of George?

class 1 [0.399] Why can't we make radiation death rays with 
                huge spread and nearly infinite
                range to nullify Western air power once and for all and 
                put a stop their invasions of our sovereignty?
class 1 [0.399] What country is more tolerant to gays, 
                Saudi Arabia or USA?

What are insincere topics where the network strongly believes correctly ?

class 1 [0.99] Why are the Quora moderators f*cking ***holes?
class 1 [0.98] Why is it that Quora is so f*cking liberal and stupid?

This gives us a general idea of what is happening and help us develop some intuition of where the model shines and where it goes wrong. In particular we also get to see some words that might be clearly tipping the classification scales in a particular direction.

Model Design

In the image shared above you get an idea of how our model was structured.

Let’s start with the input data, we have used an average of the pre-trained GloVe and Paragram word embeddings. The reason for this choice was that Glove and Paragram were found to contain above 80% of our corpus. Further, taking an average of these embeddings was giving better results on our baseline model.
Spatial Dropout is being used immediately after the embeddings. This makes sure are model is more robust during training and prevents over-fitting.
Following which we have used a bidirectional LSTM layer, with 40 units. We decided to split this model into two pathways:

  • An attention layer
  • Another bidirectional LSTM layer, with 40 units

The best way to look at this is like we have made two branches, the prior (LSTM – Attention) maintains the simplicity and the latter branch allows the model to learn more by having an additional layer. The reason we selected 40 units is mostly based of experimentation and intuition, we noticed that by having a larger number of units the model started to over fit almost immediately.

In the latter branch, the output of the 2nd bidirectional LSTM layer was being used for three operations, namely, an attention layer, global average pooling and global max pooling. Now, each of these layers bring forth diverse features from the data and contain 80 units each.

All these outputs (from both prior and latter branches) are concatenated (as in the concatenation of 4 outputs we have 320 units) and fed into a layer of 32 units, with a RELU activation. This is followed by batch normalization and dropout to speed the computation and help reduce over-fitting. After this, we have the final output later with a sigmoid activation function.

Performance

Kaggle had set the evaluation metric to be the F1 score. This was a suitable choice, instead of accuracy, because of the class imbalance present in the dataset. Moreover, due to some of the questions being labelled incorrectly, techniques used to handle class imbalance, such has undersampling and oversampling, might actually increase the incorrectly labelled questions or decrease the correctly labelled ones. Further, computational constraints were another important factor to keep in mind while making any decision.

Our approach for model validation included creating a train and validation data set. We ran 10 epochs which is the maximum we could run with this model as post this we would start overfitting or violate computational constraints. We started SWA from the 4th epoch as at that point the F1 score had already reached close to 0.65. Thus, it was good point to start the process.
Once we had the final predictions from the model we used a threshold to binarize the probabilities which was obtained on the basis of the validation dataset.

Kaggle’s score calculations process involved only 15% data for the public leaderboard and remaining for the private leaderboard. Our final model returned a score of 0.68 on the public leaderboard and around 0.68875 in the private leaderboard. This stability in the score was a good demonstration of a good generalized model.

Here, is a look at the confusion matrix:

Conclusion

The traditional ML approach yielded a score of 0.583 as compared to the deep learning model’s score of 0.68.
While the deep learning model clearly outperformed the traditional ML stacking, there are a few points to consider before you set the course of your text classification problem:

  • The computational effort was larger while building the deep learning model. We had used a standard kaggle kernel (14GB RAM, 2GB GPU) and further as per competition rules there was a 2 hour time limit on GPU usage.
  • The journey of taking all the decisions associated with the model architecture, though an interesting process, took significant time and effort. A lot of experimentation was involved before nailing down on the final model’s design. Further, there can even be more room for improvement in the architecture.
  • Data Sufficiency, that is for any DL approach, a prerequisite is the availability of a large dataset. In our use case the train set had 1.3M records.
  • If performance is compared, then DL is definitely the victor, with even our baseline model outperforming the stacked ML model.
  • The deep learning model did not require the extensive text pre-processing and the feature engineering involved while training.
  • The model was able to take advantage of the pre-trained word embeddings and learn a lot of the intricate patterns found in the text.

References

About Us

Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.

About the writers:

  • Ujjayant Sinha: Data science enthusiast with interest in natural language problems.
  • Ankit Gadi: Driven by a knack and passion for data science coupled with a strong foundation in Operations Research and Statistics has helped me embark on my data science journey.
]]>
https://www.datasciencediscovery.com/index.php/2019/03/11/nlp-with-dl/feed/ 0 422
NLP with ML https://www.datasciencediscovery.com/index.php/2019/02/18/nlp-with-ml/?utm_source=rss&utm_medium=rss&utm_campaign=nlp-with-ml https://www.datasciencediscovery.com/index.php/2019/02/18/nlp-with-ml/#respond Mon, 18 Feb 2019 22:39:32 +0000 http://datasciencediscovery.com/?p=400 Text Classification Purpose: Natural language processing (NLP) has been widely popular, with the large amount of data available (in emails, web pages, sms) it becomes important to extract valuable information from textual data. An assortment of machine learning techniques designed to accomplish this task. With current advances in deep learning, we felt it would be […]]]>

Text Classification

Purpose:
Natural language processing (NLP) has been widely popular, with the large amount of data available (in emails, web pages, sms) it becomes important to extract valuable information from textual data. An assortment of machine learning techniques designed to accomplish this task. With current advances in deep learning, we felt it would be an interesting idea to compare traditional and deep learning techniques. We decided to pick up a playground kaggle data set with the purpose of text classification and proceeded to implement both these types of algorithms for comparison purposes.

Problem

In today’s world, websites have to deal with toxic and divisive content. Especially major websites like Quora which cater to large traffic and their purpose is to provide a platform to people for asking and answering questions. A key challenge is to weed out insincere questions, those founded upon false premises or questions that intend to make a statement rather than look for helpful answers.

A question is classified as insincere if:

  • Non-neutral tone directed at someone
  • Discriminatory or contains abusive language
  • Contains false information

For more information regarding the challenge you can use the following link.

Code

The full code is available here.

Methodology

In this article we will tackle text classification by using machine learning and NLP techniques. For any data science problem with textual data the common steps include:

  • Data exploration
  • Text pre-processing
  • Feature engineering
    • Text sentiment
    • Topic modelling
    • TFIDF and Count Vectorizer
    • Text Descriptive Features
  • Model selection and Evaluation

Let’s explore them step by step in more detail.

Data Exploration

One of the most important steps of any project, you need to familiarize yourself with the data prior to implementing any modeling technique.

import os
print(os.listdir("../input"))

Our dataset includes:

  • train.csv – the training set
  • test.csv – the test set
  • sample_submission.csv – A sample submission in the correct format
  • embeddings – Folder containing word embeddings.

We are not allowed to use any external data sources. The following embeddings are given to us which can be used for building our models.

A look at the size of our train and test data:

  • Shape of train: (Rows 1,306,122 with 3 columns)
  • Shape of test: (Rows 56,370 with 2 columns)
What does the data look like?

In the target variable 1 represents the class Insincere and 0 the Sincere class of questions.

Let’s explore the distribution of the target variable:

import seaborn as sns
color = sns.color_palette()

%matplotlib inline

from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

cnt_srs = train_df['target'].value_counts()
## target distribution ##
labels = (np.array(cnt_srs.index))
sizes = (np.array((cnt_srs / cnt_srs.sum())*100))

trace = go.Pie(labels=labels, values=sizes)
layout = go.Layout(
    title='Target distribution',
    font=dict(size=18),
    width=600,
    height=600,
)
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename="usertype")
1: Insincere & 0: Sincere

Box Plots:

These box plots shared below will help understand if there are any patterns in the dataset regarding the word count or the number of characters.

Per question Insincere questions have more words

Insincere questions > characters than sincere questions

Sincere questions have lesser punctuation’s

More upper case words in sincere questions

Word Clouds:

For questions classified as sincere we see general words like “will”, “one” and so on. We also see the word “will” prevalent for insincere questions. During the data processing steps we will have to treat common words. Another point brought out in the word cloud is how words like “Trump”,”liberal” are very specific to insincere words, possibly because the person is making a statement about these topics rather than genuinely providing an answer.

Sincere

Insincere

Text pre-processing

Usually unstructured text data will be dirty that is it will have misspelled words, case-insensitive words and various other issues. We need to clean the text and bring it to a standardized form before extracting information from it as without this step there will be noise resulting in a poor model.

Broadly, consider the following steps:

Tokenization:

Tokenization refers to the splitting of strings of text into smaller chunks or tokens. Paragraphs or large bodies of text are tokenized into sentences and then sentences are broken down into individual words.

Normalization:

This refers to a series of steps that transforms the corpus of text into a single standard and consistent form. The following steps are a part of this process:

  • Converting all letters to lowercase
  • Removing punctuation marks, numbers, stop words (a, is, will etc.)

Stemming, which involves chopping off the end of a word or inflectional endings (-ing, -ed etc.) to get its root form or stem, using crude heuristic rules.

burning -> burn.

Stemming generally works well most of the time, but can often return words which might not look correct intuitively.

difficulties -> difficulti

Lemmatization has the same goal as stemming. However, it uses a vocabulary and the morphological analysis of words, to remove inflectional endings and return the dictionary form of a word, known as the lemma. Unlike stemming, lemmatization aims to reduce the word properly so that it makes sense according to the language.

ran -> run, difficulties -> difficulty

The idea for stemming or lemmatization of words is to reduce words into a common form. For example, difficulties and difficulty will portray the same intent and context.

For our use case we have performed the following operations to clean the data (using the library NLTK):

  • Convert to lower-case.
  • Remove punctuation and numbers.
  • Removing Stop words: NLTK corpus contains 179 stop words such as “for”, “having”, “yours” and so on.
  • Lemmatize words
import nltk

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer 

#lower case
all_data['question_text'] = all_data['question_text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
#Removing Punctuation
all_data['question_text'] = all_data['question_text'].str.replace('[^\w\s]','')
#Removing numbers
all_data['question_text'] = all_data['question_text'].str.replace('[0-9]','')
#Removing stop words and words with length <=2
from nltk.corpus import stopwords
stop = stopwords.words('english')
all_data['question_text'] = all_data['question_text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop and len(x)>2))
# Lemmatize
from nltk.stem import WordNetLemmatizer
wl = WordNetLemmatizer()
all_data['question_text'] = all_data['question_text'].apply(lambda x: " ".join(wl.lemmatize(x,'v') for x in x.split()))

Feature Engineering

This part is what makes the difference between a good and a bad solution in any ML project. So what features can we create in our usecase. We can start with understanding the sentiment.

Text sentiment:

Sentiment is a part of opinion mining and it involves building a system to extract the opinion from a text. That is we wish to get a score to understand how positive or negative the text is.
The assumption with respect to our data set was perhaps the questions flagged as Insincere may contain toxic content and would exhibit a negative sentiment. However, as far as modeling features are concerned sentiment turned out to be a weak feature. On deeper evaluation, we noticed that there were several questions with high polarity scores with insincere and sincere tags.

Topic modelling:

Topic modelling is an approach to identify topics present across a corpus of text. A topic is defined as a repeating pattern of co-occurring terms in a corpus. A document contains multiple topics in varying proportions. So, for example, a document based on healthcare is more likely to contain a higher ratio of words like “doctor” and “surgery” than words such as “brakes” and “gear”, which indicate a theme of automobiles.

Using a technique like Latent Dirichlet Allocation to get the distribution of topics across the corpus would potentially help to get a sense of the themes discussed in the set of questions. Further, we hypothesize that there would be some difference between the topics of sincere and insincere questions.

The image below shows the distribution of the topics with respect to the different classes by taking an average.

Topic Distribution Sincere
Insincere Topic Distribution

Looking up top words from top topics from class insincere:

  • Topic 45: trump, part, president, donald, drink, similar, sport, websites, suffer, insurance, abroad, court, respect, would, wall.
  • topic 59: quora, question, ask, answer, wear, control, actually, treat, people, hear, worst, western, racist, many, opportunities.
  • Regarding topic 62: sex, hate, act, culture, pakistan, add, society, doctor, bring, present, people, search, pressure, characteristics, enjoy.
  • In topic 77: want, don’t, tell, guy, try, like, know, doesn’t, kill, people, say, let, brain, get, would.
  • For topic 79: women, men, white, black, water, watch, share, video, others, character, youtube, save, problem, prevent, people.

Looking up top words from top topics from class sincere:

  • topic 0: use, like, best, possible, make, come, cause, good, become, would, singer, get, know, happen, etc.
  • topic 1: make, use, like, best, cause, good, happen, many, would, find, better, nutritional, jar, work, venus.
  • For topic 34: job, engineer, company, chinese, get, work, project, interview, graduate, best, include, india, example, good, accord.
  • Regards to topic 56: someone, feel, love, man, process, like, post, view, would, care, else, give, advice, step, night.
  • For topic 77: want, dont, tell, guy, try, like, know, doesnt, kill, people, say, let, brain, get, would.

Count Vectorizer/tf-idf:

Countvectorizer returns a matrix which shows the frequency of each term in the vocabulary per document. On the other hand, tf-idf (term frequency-inverse document frequency) evaluate how important a word is to a document in the corpus.

tf(x) = (Number of times term x appears in a document) / (Total number of terms in the document)

idf(x) = log(Total number of documents / Number of documents with term x in it)

tf-idf = *tf(x) * idf(x)*

Clearly, the importance of a word in a document increases proportionally to the number of times a it appears there. But, it is offset by the number of times it occurs in the corpus.

Both tf-idf and countvectorizer, as features, may indicate the relevance of a certain set of words to questions labelled as “Sincere” as well as “Insincere”.

The image below is obtained by using a TF-IDF vectorizer to create features and a k-fold CV logistic regression model and it shows the words (of insincere questions) with most weight.

Text Descriptive features

The idea behind building features such as the number of unique words, characters or exclamation points is to check for uniformity in the data set. We wish to observe if there are some similarities between the train and test set. Some questions that these meta features help answer include:

  • Is it that our test set consists of very small questions as compared to the train set?
  • A question framed insincerely might be haphazardly framed with disregard for the correct use of punctuations and possibly contain an abnormally high count.
  • A user writing a toxic or insincere question may be using uppercase letters very liberally.

The examples mentioned above give us the idea that there might be certain patterns specific to the respective classes that can be leveraged in our model. To give an ad hoc example of how useful meta features can be, on a musical note, the number of words per minute for Eminem is different based on the content/emotion of the song.

Some of the meta features are listed below:

  • Number of Words, Unique Words, Characters

We have added some box plots in the data exploration section to provide you with an idea regarding the prevalent distribution with respect to the different classes.

Model

So far we have cleaned up our text and carried out feature engineering. Now, there are several ways to select the relevant features however, for the purpose of this article we decided to generate separate models for each set of features as this will help develop a general understanding and help utilize these tactics on other text classification datasets.

A few things to note:

  • We are using F1 Score as our performance metric as required by the competition rules. It also gives us a better picture than accuracy keeping in mind the imbalance in the data.
  • For each model we are using 5 fold cross validation.
  • In order to find the suitable threshold (to convert the probabilities to a binary) we have developed a loop. In this loop we try multiple potential thresholds and choose the one that maximizes the F1 score. The F1 score is calculated on the validation data set.

There are two pieces of code that will be reused in most of the models:

kf = KFold(n_splits=5, shuffle=True, random_state=43)
## Initialize 0’s
test_pred_ots = 0
oof_pred_ots = np.zeros([train.shape[0],])

train_target = train['target'].values

x_test = test[selected_features].values


## Loop to split the data set
for i, (train_index, val_index) in tqdm(enumerate(kf.split(train))):
    x_train, x_val = train.loc[train_index][selected_features].values, 
    train.loc[val_index][selected_features].values
    y_train, y_val = train_target[train_index], train_target[val_index]
    
   # Model
    classifier = LogisticRegression(C= 0.1)
    classifier.fit(x_train, y_train)
    
    ## Validation set predicted
    val_preds = classifier.predict_proba(x_val)[:,1]
    
    ## Test set predictions
    preds = classifier.predict_proba(x_test)[:,1]
    test_pred_ots += 0.2*preds
    oof_pred_ots[val_index] = val_preds
print("--- %s seconds for Model Selected Features ---" % (time.time() - start_time))

The code above runs 5 fold cross validation and with each split we train and make predictions on the validation and test datasets. At the end of all splits we get oof_pred_ots which are predictions on the validation data sets combined into a single data frame. We also get the average prediction probabilities of each split in test_pred_ots.

thresh_opt_ots = 0.5
f1_opt = 0
for thresh in np.arange(0.1, 0.91, 0.01):
    thresh = np.round(thresh, 2)
    f1 = metrics.f1_score(train_target, (oof_pred_ots.astype(float) >thresh).astype(int))
    #print("F1 score at threshold {0} is {1}".format(thresh, f1))
    if f1_opt < f1:
        f1_opt = f1
        thresh_opt_ots = thresh
print(thresh_opt_ots)
pred_train_ots = (oof_pred_ots > thresh_opt_ots).astype(np.int)
f1_score(train_target, pred_train_ots)

The code above will help find the best threshold.

First Model:
We used the text descriptive features and ran a 5-fold cross validation logistic regression model however, the F1 score is not that significant (0.27).

Second Model:
We used the sentiment and topic modeling features and ran the same model as mentioned before. This time we got a better score (0.34).

Third Model:
We used TFIDF features and tried logistic regression (F1 – 0.587) and light gbm (F1 – 0.591). This is much better.

Fourth Model:
We used countvectorizer and tried logistic regression (F1 – 0.592) and a multinomial (F1 – 0.55) and bernoulli (F1 – 0.53) naive bayes models.

Ensemble:

The idea here is that one model might be observing patterns that the other isn’t. Further, ensemble will help get better results and at the same time reduce the chance of over fitting. We used stacking which means that we make predictions on the entire train set. This is accomplished by splitting data at each folds into train and holdout set and making predictions on the holdout set. This splitting of the data is carried out such that there is a prediction for each row in the train data set.
We use these new predictions from the respective models as input variables and run another (logistic regression) model on top of this giving us the final probabilities.
Our final F1 Score (0.604) and on the leader-board (0.589).

What’s Next

In the next article we will implement a deep learning approach to the same use case and draw comparisons between the two methodologies.

References:

About Us

Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.

About the writers:

  • Ujjayant Sinha: Data science enthusiast with interest in natural language problems.
  • Ankit Gadi: Driven by a knack and passion for data science coupled with a strong foundation in Operations Research and Statistics has helped me embark on my data science journey.
]]>
https://www.datasciencediscovery.com/index.php/2019/02/18/nlp-with-ml/feed/ 0 400
Deep Dive https://www.datasciencediscovery.com/index.php/2019/01/11/deep-dive/?utm_source=rss&utm_medium=rss&utm_campaign=deep-dive https://www.datasciencediscovery.com/index.php/2019/01/11/deep-dive/#respond Fri, 11 Jan 2019 14:25:00 +0000 http://www.datasciencediscovery.com/?p=770 Our team carries out an in-depth breakdown (deep dive) of complex topics such as NLP, Deep Learning, UMAP and several others. We also implement challenging projects, providing detailed explanations of the decisions and process involved. This section will be updated as we add more content to our blog. Please find the links to the respective […]]]>

Our team carries out an in-depth breakdown (deep dive) of complex topics such as NLP, Deep Learning, UMAP and several others.

We also implement challenging projects, providing detailed explanations of the decisions and process involved. This section will be updated as we add more content to our blog. Please find the links to the respective topics below:

Deep Learning – Introduction

What is this mythical beast I keep hearing about? Today, Deep Learning is a buzzword for a well deserved reason. Let’s do a deep dive into this subject and slay this beast. [Read More]

GANs – Introduction

The buzz around Deep-fakes has reached far and wide, further it has been a candidate of conversation for several months. Let’s understand what the buzz is all about and learn more about generative adversarial networks. [Read More]

Natural Language Processing

This series on Natural Language Processing is designed with the idea to start from scratch and slowly make our way to the state of the art models we hear about today. Layer by layer we will develop the necessary concepts and implement the same to strengthen our foundation. [Coming Soon]

Practice NLP

In today’s world, websites have to deal with toxic and divisive content. Let’s try to implement a text classification exercise. We would try multiple traditional algorithms and also implement some of the latest deep learning models. [Read More]

Dimension Reduction

What to do when your data has too many variables? Can I visualize the data? Discover dimension reduction techniques including PCA, UMAP and others. [Read More]

Stock Prediction

There is a lot of research around this topic, diverging into a plethora of different techniques ranging from econometric models, time-series models to even deep learning models. It becomes really difficult to understand what to implement and if it will work. We will implement a time series and deep learning technique and guide you through our steps and decisions. [Coming Soon]

]]>
https://www.datasciencediscovery.com/index.php/2019/01/11/deep-dive/feed/ 0 770
Mechanics of Deep Learning https://www.datasciencediscovery.com/index.php/2018/12/04/mechanics-of-deep-learning/?utm_source=rss&utm_medium=rss&utm_campaign=mechanics-of-deep-learning https://www.datasciencediscovery.com/index.php/2018/12/04/mechanics-of-deep-learning/#respond Tue, 04 Dec 2018 13:11:03 +0000 http://datasciencediscovery.com/?p=511 Understand the concept of Gradient Descent and Back-propagation to get some idea of how Neural Networks work. Deep Learning Series In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It […]]]>

Understand the concept of Gradient Descent and Back-propagation to get some idea of how Neural Networks work.

Deep Learning Series

In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.

Neural Networks &

Mechanics of Deep Learning

We have already covered some of the basics of the architecture and the respective components in the previous posts. But we need to understand one of the most important concepts.

How do Neural networks exactly work?

How are the weights updated in Neural networks?

Well, let’s get into the algorithms behind Neural Networks.

Gradient Descent

For most machine learning algorithms, optimization is used to minimize the cost/error function. Gradient Descent is one of the most popular optimization algorithms used in Machine Learning. There are many powerful ML algorithms that use gradient descent such as linear regression, logistic regression, support vector machine (SVM) and neural networks.

Intuition

Let’s take the classic mountain valley example with a twist, you meet a pirate and in your travels you discover a map to the golden chalice of wisdom. The secret location is the lowest point in a very dark and deep valley. Given that there is no possible sources of natural or artificial light in this magical valley, both the pirate and you are in a race to reach the bottom of the valley in pitch darkness. The pirate decides to take steps forward randomly with the hope of eventually reaching the lowest point.

Both of you have the same starting point, you think there must be a smarter way. At every step you decide to feel the gradient (slope) around you, and take the steepest step possible. By taking the best possible step every time, you win!

That is analogous to the gradient descent technique. We are operating in the blind trying to take a step in the most optimal direction.

Let us say that we fit a regression model on our dataset. We need a cost function to minimize the error between our prediction and the actual value. The plot of our cost function will look like:

Gradient Descent

Source.

Gradient is another word for slope and the first step in gradient descent is to pick a starting value at random or set it to 0. Now, a gradient has the following characteristics:

  • Direction
  • Magnitude

Let’s take a mathematical function to further understand the same.

In mathematical terms, if our function is:

$
f(x) = e^{2}\sin(x)
$

The derivative:

$
\frac {\partial f}{\partial x} = e^2\cos(x)
$

If x = 0

$
\frac{\partial f}{\partial x} (0) = e^2 \approx 7.4
$

So when you start at 0 and move a little (take a step), the function changes by about 7.4 times (magnitude) the amount that you changed. Similarly, if you have multiple variables we take partial derivatives:

$
z = f(x,y) = xy + x^2
$

For a function such as the one above we first take y as a constant and follow differentiate it in terms of x ( Here: y + 2x). Then we take x as a constant and take the derivative in terms of y (Here: x). Consider if x = 3 and y = -3 then f(x,y) = 9. The final value is obtained from the use of the chain rule of calculus.

Chain Rule

$\nabla f $

the sign of the final gradient points in the direction of greatest change of the function.

In a feed-forward network, we are learning how does the error vary as the weight is adjusted. The relationship between the net’s error and a single weight will look something like the image below (we will get into more detail a little later):

Backprop Chain Rule

As a neural network learns, it slowly adjusts several weights by calculating (dE/dw) the derivative of network Error with respect to the weights.

Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point. For example:

MetricValue
Gradient Magnitude2.5
Learning Rate0.01

Then the gradient descent algorithm will pick the next point 0.025 away from the previous point. A small learning rate will take too long and a very large learning rate the algorithm might diverge away from the minimum point (miss the minimum completely).

Finally, the weights are updated incrementally after each epoch (pass over the training dataset) till we get the best results.

Stochastic Gradient Descent

In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration. So far, we have assumed that the batch has been the entire data set. But for large datasets, the gradient computation might be expensive.
stochastic gradient descent offers a lighter-weight solution. At each iteration, rather than computing the gradient ∇f(x), stochastic gradient descent randomly samples i at uniform and computes ∇fi(x) instead.

Back-propagation

Back-propagation is simply a technique or method of updating the weights. We are aware of partial derivatives, chain rule and most importantly gradient descent. But with Neural networks having multiple layers and different activation functions make it difficult to visualize how everything comes together. Consider, a simple example with the following architecture:

Backprop Explain

Forward Pass

Step 1: Initialization Let us initialize the weights and the bias.Table 1 a: Weight Initialization Example

WeightsValue
w10.10
w20.15
w30.03
w40.08
w50.18
w60.06
w70.11
w80.26

Table 1: Dominated/Non-Dominated Example

BiasValue
b10.05
b20.42

Assume take the initial input values to be [0.95,0.06] and the target value [0.05,0.82].

Step 2: Calculations

To get the value of H1:

H1 = w1 * x1 + w2 * x2 + b1
   = 0.1 * 0.95 + 0.15 * 0.06 + 0.05
   = 0.154

As we have a sigmoid activation function:

$
\frac{1}{1+e^{-X}}
H1 = \frac{1}{1+e^{-H1}} = \frac{1}{1+e^{-0.154}} = 0.538
$

Similarly, we can calculate H2.

H1 = 0.538 and H2 = 0.52

Now we calculate the value for output nodes Y1 and Y2.

Y1 = w5 * H1 + w6 * H2 + b2
   = 0.18 * 0.538 + 0.06 * 0.52 + 0.42
   = 0.548

$
Y1 = \frac{1}{1+e^{-Y1}} = \frac{1}{1+e^{-0.548}} = 0.633
$

Upon calculation:

Y1 = 0.633 & Y2 = 0.648

Step 3: Error Function Let the error function be:

$
J( \theta ) = {( {target – {output}})^2}
$

Total Error (E) = E1 + E2 = 0.184972
E1 = 0.5 * (0.05 - 0.63368)^2 = 0.17
E2 = 0.5 * (0.82 - 0.64893)^2 = 0.014 

Backward Pass

Back-propagate the Errors to update the weights.

Error at W5:

$
\partial E \over \partial W5
$

$
= ({\partial E \over \partial output Y1}) * ({\partial output Y1 \over \partial Y1}) * ({\partial Y1 \over \partial W5})
$

Component 1: The Cost/Error Function

target: T
output: out
E = 0.5 * (T1 - out Y1)^2 + 0.5 * (T2 - out Y2)^2
Differentiating:
- (T1 - out Y1) = - (0.05 - 0.63368) = 0.58368

Component 2: The Activation function

output: out
out Y1 = 1/(1 + exp(-Y1))
Differentiating:
out Y1 * (1 - out Y1) = 0.63368 * (1 - 0.63368) = 0.23213

Component 3: The Function of Weights

Y1 = w5 * H1 + w6 * H2 + b2
Differentiating:
H1 * 1 = 0.538

Finally, we have the change in W5:

$ \partial E \over \partial W5
$

=0.58368∗0.23213∗0.538

=0.07289

In order to update W5 recall the discussion on gradient descent. Let alpha be learning rate with a chosen value of 0.01.

Updated W5 will be:

$
W5 + \alpha * ({\partial E \over \partial W5})
$

=0.18+0.01∗0.07289

=0.1807289

Similarly, we can update the remaining weights. Let’s have a look at the formula to update W1:

\frac{\partial E}{\partial w1}

equals

$
(\sum\limits_{i}{\frac{\partial E}{\partial out_{i}} * \frac{\partial out_{i}}{\partial Y_{i}} * \frac{\partial Y_{i}}{\partial out_{h1}}}) * \frac{\partial out_{h1}}{\partial H1} * \frac{\partial H1}{\partial w_{1}}
$

Backprop Explain

It feels like it is complicated, but really we are going back layer by layer to get the respective value. As w1 feeds into neuron H1 and H1 is connected to Y1 and Y2. Moving backwards, we are differentiating the error function following which Y1 and Y2 (the activation function and the function of Weights) . That leads us to H1 where we differentiate its activation function and its respective function of weights.

This is how we back-propagate the errors and update all the weights. Once we update all the weights, that is one epoch or pass over the dataset. Further, we start the entire process of forward pass and backward pass again. This process is repeated for multiple times with the purpose of minimizing error.

When do we stop?

We stop prior to over-fitting that is we want the minimum validation error but we do not want the training error to be lower than the validation error.

Hopefully, this explains the entire process of how neural networks actually work and sheds some light on gradient descent and back-propagation.

What’s Next?

Activation: We have talked about activation functions in the past posts, but let’s understand in more detail the different types of activation functions and explore their characteristics.

]]>
https://www.datasciencediscovery.com/index.php/2018/12/04/mechanics-of-deep-learning/feed/ 0 511
Deep Learning Architecture https://www.datasciencediscovery.com/index.php/2018/12/01/deep-learning-architecture/?utm_source=rss&utm_medium=rss&utm_campaign=deep-learning-architecture https://www.datasciencediscovery.com/index.php/2018/12/01/deep-learning-architecture/#respond Sat, 01 Dec 2018 09:52:47 +0000 http://datasciencediscovery.com/?p=533 What are Neural Networks made of? Understanding the different components and the architecture of Neural Networks. Deep Learning Series In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is […]]]>

What are Neural Networks made of? Understanding the different components and the architecture of Neural Networks.

Deep Learning Series

In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.

Architecture

The introduction to neural networks and a general idea behind the inspiration for such an algorithm has been discussed in the previous post. We will talk about the building blocks of neural networks in detail in future posts, but in this post we focus on the overall structure of Neural networks and discuss some of the components.

Terms

  • Size: The number of nodes in the model.
  • Width: The number of nodes in a specific layer.
  • Depth: The number of layers in a neural network.
  • Capacity: The type or structure of functions that can be learned by a network configuration.
  • Architecture: The specific arrangement of the layers and nodes in the network.
  • Input Layer: Input variables, sometimes called the visible layer.
  • Hidden Layers: Layers of nodes between the input and output layers.
  • Output Layer: A layer of nodes that produce the output variables.

Now, let’s briefly discuss the elements of a neural network.

The Elements of a Neuron

Neural networks are a set of algorithms, inspired by the working of the human brain. These algorithms are designed to recognize patterns. Neural networks consist of layers which are made of nodes. These nodes are where all the calculations happen.

NN Layer

Source

  • Neurons: The term artificial neuron is used in the literature interchangeably with: node, unit, perceptron, processing element or even computational unit. A neuron is the component of neural network where a mathematical transformation takes place. It uses either input data or resultant data from other neurons. It consists of the following components:

Weights:

Each input has its own relative weight. Weights are adaptive coefficients that determine the intensity of the input signal as registered by the artificial neuron. Using techniques like back-propagation discussed here, the weights are updated with each iteration in order to reduce the error. For now, all we need to know is that the weights will be updated using special algorithms and that these algorithms require differentiation. So weights will be updated overtime but when we start training a neural network but:How do we initialize the weights?

Zero Initialization
  • Initialize to 0, that would reduce this model to a linear one. When we differentiate the weighted linear equation, we get the value 0.
  • The bigger issue is that with all weights having the same value, the differentiation will end with the same value. This means all the weights have the same values in the subsequent iteration. If all of the weights are the same, they will all have the same error and the model will not learn anything – there is no source of asymmetry between the neurons.
Random Initialization
  • Random Initialization sounds like a good option, but it comes with its own problem. In this method, the weights are initialized very close to zero, but randomly. It is a better option than zero initialization as the initial weights here are close to the ‘best guess’ expected value and the symmetry has also been broken enough for the algorithm to work. Then what is the problem. It is not necessarily true that smaller numbers will work better, especially in a deep neural network.
  • The gradients calculated in back-propagation are proportional to the weights. We know that as we back propagate the gradient becomes smaller. Now imagine if we initialized very small weights, it will lead to very small gradients and can lead to the network taking much longer to train or even never completing.
  • To give you an idea, with back-propagation we start with the output layer and go back layer by layer. With each layer we are calculating gradients (For simplicity let’s say differentiating some function). If my gradients are small that is the rate of change is small, the pace at which we are updating the weights is very slow. That is the problem.
He-et-al Initialization

He-et-al Initialization In this method, the weights are initialized while keeping in mind the size of the previous layer. That is we are taking into account the number of neurons in the previous layer. This helps attain a global minimum of the cost function faster.The weights are still random but differ in range. This initialization is more controlled here. More details about this technique are available here.

There are several techniques which can be used for initialization but the techniques mentioned here will give you some idea of how the weights as a component fits in neural networks.

Functions & Components

Summation Function
  • Summation Function: The inputs and corresponding weights are vectors which can be represented as (x1, x2 … xn) and (w1, w2 … wn). The total input signal is the dot product of these two vectors. The result is a single number (x1 * w1) + (x2 * w2) +…+(xn * wn). The input and weighting coefficients can be combined in many different ways before passing on to the transfer function. Now, keep in mind that the summation function is only nomenclature. The function can be any kind of aggregation such as the minimum, maximum, majority, product or any other normalizing algorithm. The network architecture and paradigm determine the specific algorithm for combining neural inputs.

Activation Function

  • Transfer Function: This is also known as an activation function. Basically in the activation function the summation can be compared with some threshold (based on a function) to determine the neural output. If the sum is greater than the threshold value, the processing element generates a signal and if it is less than the threshold, no signal is generated.
Scaling and Limiting
  • Scaling and Limiting: Depending on the structure, there can be an additional step that the result can be passed through. This scaling simply has the following function (scale factor * transfer value + an offset). Limiting is the mechanism which insures that the scaled result does not exceed an upper, or lower bound.
Output Signal
  • Output Signal: Final touch here is the output signal. A neuron will have a single output that can be transferred to the neurons in the next layer. Neurons are allowed to compete with each other in some architectures. This competition determines which artificial neuron will be active or provides an output and is involved in the learning process.
Error Function
  • Error Function and Back-Propagated Value: Difference between the current output and the desired output is calculated as an error. A mathematical function that is the loss/error function is used which can be the square of the error or some other formula. Each error function has it’s use case. Softmax Cross Entropy is one such example. Softmax function is usually used when we have a multi-label classification or many classes in our use case. We touch the working of this function in more detail in a later post. The error is propagated backwards to a previous layer. This back-propagated value can be either the error, or the error scaled in some manner (often by the derivative of the transfer function).
Learning Rate
  • Learning Rate: Learning rate determines how fast weights are updated. In general, you want to find a learning rate that is low enough that the network converges to something useful, but high enough that you don’t have to spend a lot of time training it. Learning rate might be different for different layers of a neural network.

Layers

A neural network is the grouping of neurons into layers and there can be many layers between the input and output layer. Most applications require networks that contain at least the three layers – input, hidden, and output. Each neuron in the hidden layer will be connected to all the neurons in the previous layer. We can start with these two types of basic perceptrons. They feed information from the front to the back and therefore are called Feed Forward networks.

Single Layer Feed Forward Neural Network consists of a single layer, that is it will only have the input and output layer. A single-layer perceptron can only be used for very simple problems such as classification classes separated by a line.

Multi Layer Feed Forward Neural Network consists of one or more hidden layers, whose computation nodes are called hidden neurons or hidden units. A Multilayer Perceptron can be used to represent convex regions thus it can separate and work in some high-dimensional space.

NN Layer

Source

What architecture works for me?

Now, we know it makes sense to have multiple layers especially when dealing with images or complex data.

How do we decide on what architecture to use? How many hidden layers should be used?

  • Larger Neural Networks can represent more complicated functions. Even though it will improve accuracy, but it also increases the chance of over-fitting. However, there are other methods that help control over-fitting (such as L2 regularization, Dropout) so number of neurons should not be a control parameter.
  • More neurons and layers might add to the computational load.

There has to be a trade-off and there is no definite answer to this question. However, I can suggest you the following:

Experimentation: Find out what works best for your data given the computational constraints.

Intuition or Google: Based on experience of past models used you can come up with an answer. If you have a standard DL problem such as an image classification, you can Google to find out what others have used (Resnet,vgg and so on).

Search: Try random or grid search for different architectures and choose the one giving the best score.

There are several different architectures shown in the image below. To summarize what are the parameters that govern or define the architecture:

  • The Elements of a Neuron (Type of activation function, scaling, limiting……)
  • The Size, Width and Depth of the Neural Network.
  • How different neurons are connected to each other. For example the Hopfield network (HN) is a network where every neuron is connected to every other neuron.
NN Architecture
Source Fjodor Van Veen

What’s Next?

Inside the Black box: What is going on inside this Black box algorithm? Trying to build intuition and understanding of what is going on in the different layers of a neural network. Let’s continue with the learning in this next article where we take a closer look at what happens with the different neurons and respective layers of a Neural Network.

]]>
https://www.datasciencediscovery.com/index.php/2018/12/01/deep-learning-architecture/feed/ 0 533
Deep Learning Invasive Species https://www.datasciencediscovery.com/index.php/2018/11/25/deep-learning-invasive-species/?utm_source=rss&utm_medium=rss&utm_campaign=deep-learning-invasive-species https://www.datasciencediscovery.com/index.php/2018/11/25/deep-learning-invasive-species/#respond Sun, 25 Nov 2018 14:13:25 +0000 http://datasciencediscovery.com/?p=554 Don’t get alarmed, we are going to put what we have learnt into practice on a playground kaggle data set explaining the code along the way. Deep Learning Series In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series […]]]>

Don’t get alarmed, we are going to put what we have learnt into practice on a playground kaggle data set explaining the code along the way.

Deep Learning Series

In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.

Update (Dec 2018)

This was coded sometime back and utilizes the library Fastai version 0.7, however recently there have been some updates in the library and new releases in pytorch as well. The current code will no longer work with Fastai v1, while there are still some important concepts that can be learned from this code such as:

  • Practical Application of Deep Learning
  • Better modeling Practices like data augmentation, image standardization.
  • Hyperparameter tuning
  • Transfer Learning

Invasive Species

We have covered some basic concepts regarding what neural networks are and how do they work. However, I feel it has been too much theory and while learning any new concept it is also important to see that theory in action. Let’s start!!!

Let’s pick up a playground problem from Kaggle. Invasive species can have damaging effects on the environment, the economy, and even human health. Consider, tangles of kudzu that overwhelm trees in Georgia while cane toads threaten habitats in over a dozen countries worldwide. This means it is a very important to track and stop the spread of these invasive species. Think of how costly and difficult it will be to undertake this task at a large scale. Trained scientists would be required to visit designated areas and take note of the species inhabiting them. Using such a highly qualified workforce is expensive, time inefficient, and insufficient since humans cannot cover large areas when sampling.

Looks like a very interesting use case for Deep Learning.

What we need is a labeled dataset of images marked as invasive or safe. Our algorithm will take care of the rest. You can start a kernel (python jupyter notebook) using this link and follow along. Few settings to keep in mind, make sure that you have GPU and internet enabled. There are several libraries in python for deep learning however, we will use fastai.

Link The full code is available here.

Let’s start coding!!!

# Get automatic reloading and inline plotting
%reload_ext autoreload
%autoreload 2
%matplotlib inline

Just some basic commands as practice, autoreload reloads modules automatically before entering the execution and matplotlib inline is a magic command that plots your outputs better.

### Import Required Libraries
# Using Fastai Libraries
from fastai.imports import *
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *
import numpy as np
import pandas as pd
import torch
import os
PATH = "../input"
print(os.listdir(PATH))
TMP_PATH = "/tmp/tmp"
MODEL_PATH = "/tmp/model/"
sz= 224
bs = 58
arch = resnet34

Defining some variables:

  • Path: Location/path to the dataset
  • sz: size that the images will be resized to in order to ensure that the training runs quickly.
  • bs: the batch size that is we can break the data up into smaller parts.
  • arch: it is the selected architecture of the neural network model.

I know in this series we have not yet covered how the convolution function and in particular how CNN’s work. However, for now all we need to know is that CNN is a type of neural network popular for image classification and Resnet is a type of architecture. Resnet-34 has 34 layers!

The programming framework used to behind the scenes to work with NVidia GPUs is called CUDA. Further, to improve performance, we need to check for NVidia package called CuDNN (special accelerated functions for deep learning).

### Checking GPU Set up
print(torch.cuda.is_available())
print(torch.backends.cudnn.enabled)

Both of these should be true.

Now let’s look at what form the data is in, that is we need to understand how the data directories are structured, what the labels are and what some sample images look like. f’ is a convenient way to reference a path/string.

files = os.listdir(f'{PATH}/train')[:5]
## train contains image names
print(files)
img = plt.imread(f'{PATH}/train/{files[0]}')
plt.imshow(img);
print(img.shape)
kaggle CNN Image

We get the height, width and channels using img.shape. In img[:4,:4], img is a 3 dimensional array giving us the value for Red Green Blue pixel values. The image above should give us an idea of the height of the image. Now, let’s split the data into train and validation set.

label_csv = f'{PATH}/train_labels.csv'
n = len(list(open(label_csv))) - 1 # header is not counted (-1)
val_idxs = get_cv_idxs(n) # random 20% data for validation set
print(n) #Total Data size
print(len(val_idxs)) #Validation dataset size
label_df = pd.read_csv(label_csv)
### Count of both classes
label_df.pivot_table(index="invasive", aggfunc=len).sort_values('name', ascending=False)

Label CSV contains the name and the corresponding label (1 or 0) where 1 means it has an invasive tag.Table 1: Target Variable Distribution

LabelCount
11448
0847
tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}/train_labels.csv', test_name='test', 
                                   val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)

tfms stands for transformations. tfms_from_model takes care of resizing, image cropping, initial normalization and more.A pre-defined list of functions are carried on in transforms_side_on. We can also specify random zooming of images up to specified scale by adding the max_zoom parameter.

With ImageClassifierData.from_csv we are just putting together everything (train, validation set, the labels and batch size).

fn = f'{PATH}/train' + data.trn_ds.fnames[0]
#img = PIL.Image.open(fn)
size_d = {k: PIL.Image.open(f'{PATH}/' + k).size for k in data.trn_ds.fnames}
row_sz, col_sz = list(zip(*size_d.values()))
row_sz = np.array(row_sz); col_sz = np.array(col_sz)
plt.hist(row_sz);

A plot of the distribution of the size of the images. Ideally, we want all images to have a standard size to allow easier computation.

kaggle CNN Image

Our first model: To make the process quick we will first run a pre-trained model and observe the results. Further, we can tweak the model for improvements. A pre-trained model means a model created by some one else to solve a different problem, the weights corresponding to the activation function are saved/trained based on their dataset. We will try out their weights as is, that is instead of coming up with our own weights specific to our dataset, we will just use their weights. This is what we call transfer learning.

Is that a good idea?

Well, usually these weights are attained by training on a very large dataset for example Imagenet. It helps speed up the your training process.

We have train set with 1836 images and test set with 1531 which is not much to attain a high accuracy model where weights are trained from scratch. Further, in the article regarding the black box we had observed how gradients and edges are found in the initial layer of a neural network. That is useful information for our use case as well.

Let us form a function to get the data and resize images if necessary.

def get_data(sz, bs): # sz: image size, bs: batch size
    tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
    data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}/train_labels.csv', test_name='test',
                                       val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)
    
    return data if sz > 500 else data.resize(512,TMP_PATH) 
# Reading the jpgs and resizing is slow for big images, so resizing them all to standard size first saves time
data = get_data(sz, bs)
learn = ConvLearner.pretrained(arch, data, precompute=True,tmp_name=TMP_PATH, models_name=MODEL_PATH)
learn.fit(1e-2, 3)

ConvLearner.pretrained builds learner that contains a pre-trained model. The last layer of the model needs to be replaced with the layer of the right dimensions. The pretained model was trained for 1000 classes therfore the final layer predicts a vector of 1000 probabilities. However, what we need is only a two dimensional vector. The diagram below shows in an example how this was done in one of the earliest successful CNNs. The layer “FC8” here would get replaced with a new layer with 2 outputs.

Parameters are learned by fitting a model to the data. Hyperparameters are another kind of parameter, that cannot be directly learned from the regular training process. These parameters express “higher-level” properties of the model such as its complexity or how fast it should learn. In learn.fit we provide the learning rate and the number of epochs (times we pass over the complete dataset).

The output of learn.fit is:Table 2: Loss/Accuracy By Epoch

epochtrn_lossval_lossaccuracy
00.3790210.1965310.932462
10.2851490.1682390.947712
20.2291990.143430.947712

94% accuracy on our first model!!!

Error Analysis

Let’s form some function to try and understand what the model is doing correct and wrong. we will explore:

  • A few correct labels at random
  • A few incorrect labels at random
  • The most correct labels of each class (i.e. those with highest probability that are correct)
  • The most incorrect labels of each class (i.e. those with highest probability that are incorrect)
  • The most uncertain labels (i.e. those with probability closest to 0.5).
# this gives prediction for validation set. Predictions are in log scale
log_preds = learn.predict()
print(log_preds.shape)
preds = np.argmax(log_preds, axis=1)  # from log probabilities to 0 or 1
probs = np.exp(log_preds[:,1])        # pr(1) # Where Species = Invasive is class 1


def rand_by_mask(mask): return np.random.choice(np.where(mask)[0], min(len(preds), 4), replace=False)
def rand_by_correct(is_correct): return rand_by_mask((preds == data.val_y)==is_correct)
def plots(ims, figsize=(12,6), rows=1, titles=None):
    f = plt.figure(figsize=figsize)
    for i in range(len(ims)):
        sp = f.add_subplot(rows, len(ims)//rows, i+1)
        sp.axis('Off')
        if titles is not None: sp.set_title(titles[i], fontsize=16)
        plt.imshow(ims[i])
def load_img_id(ds, idx): return np.array(PIL.Image.open(f'{PATH}/'+ds.fnames[idx]))

def plot_val_with_title(idxs, title):
    imgs = [load_img_id(data.val_ds,x) for x in idxs]
    title_probs = [probs[x] for x in idxs]
    print(title)
    return plots(imgs, rows=1, titles=title_probs, figsize=(16,8)) if len(imgs)>0 else print('Not Found.')

def most_by_mask(mask, mult):
    idxs = np.where(mask)[0]
    return idxs[np.argsort(mult * probs[idxs])[:4]]

def most_by_correct(y, is_correct): 
    mult = -1 if (y==1)==is_correct else 1
    return most_by_mask(((preds == data.val_y)==is_correct) & (data.val_y == y), mult)

Let’s take a look at what we get if we were to call these functions. Keep in mind our classification threshold is 0.5.

# 1. A few correct labels at random
plot_val_with_title(rand_by_correct(True), "Correctly classified")
kaggle CNN Image
# 2. A few incorrect labels at random
plot_val_with_title(rand_by_correct(False), "Incorrectly classified")
kaggle CNN Image
# Most correct classifications: Class 0
plot_val_with_title(most_by_correct(0, True), "Most correct classifications: Class 0")
kaggle CNN Image
# Most correct classifications: Class 1
plot_val_with_title(most_by_correct(1, True), "Most correct classifications: Class 1")
kaggle CNN Image
# Most incorrect classifications: Actual Class 0 Predicted Class 1
plot_val_with_title(most_by_correct(0, False), "Most incorrect classifications: Actual Class 0 Predicted Class 1")
kaggle CNN Image
# Most incorrect classifications: Actual Class 1 Predicted Class 0
plot_val_with_title(most_by_correct(1, False), "Most incorrect classifications: Actual Class 1 Predicted Class 0")
kaggle CNN Image
# Most uncertain predictions
most_uncertain = np.argsort(np.abs(probs -0.5))[:4]
plot_val_with_title(most_uncertain, "Most uncertain predictions")
kaggle CNN Image

Scope of Improvement:

  • Find an Optimal Learning Rate
  • Use Data Augmentation techniques
  • Instead of using a Pre-trained model, train more layers of the neural network based on our dataset
## How does loss change with changes in Learning Rate (For the Last Layer)
learn.lr_find()
learn.sched.plot_lr()
kaggle CNN Image

The method learn.lr_find() helps you find an optimal learning rate. It uses the technique developed in the 2015 paper Cyclical Learning Rates for Training Neural Networks, where we simply keep increasing the learning rate from a very small value, until the loss stops decreasing.

# Note that the loss is still clearly improves till lr=1e-2 (0.01). 
# The LR can vary as a part of the stochastic gradient descent over time.
learn.sched.plot()

We can see the plot of loss versus learning rate to see where our loss stops decreasing:

kaggle CNN Image

Now, that we have an idea of how to select our learning rate. To set the number of epochs, we just need to ensure that there is no over-fitting. Let’s talk about data augmentation.

Data augmentation is a good step to prevent over-fitting. That is, by cropping/zooming/rotating the image, we can ensure that the model does not learn patterns specific to the train data and generalizes well to new data.

def get_augs():
    tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
    data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}/train_labels.csv',
                                        bs = 2, tfms=tfms,
                    suffix='.jpg', val_idxs=val_idxs, test_name='test')
    x,_ = next(iter(data.aug_dl))
    return data.trn_ds.denorm(x)[1]
    
# An Example of data augmentation
ims = np.stack([get_augs() for i in range(6)])
plots(ims, rows=2)
kaggle CNN Image

With precompute = TRUE, all layers of the Neural network are set to frozen excluding the last layer. Thus we are only updating the weights in the last layer with our dataset. Now, we will train the model with the option precompute as false and cycle_len enabled. Cycle Length uses a technique called stochastic gradient descent with restarts (SGDR), a variant of learning rate annealing, which gradually decreases the learning rate as training progresses. In other words, SGDR reduces the learning rate every mini-batch, and reset occurs every cycle_len epoch. This is helpful because as we get closer to the optimal weights, we want to take smaller steps.

learn.precompute=False
learn.fit(1e-2, 3, cycle_len=1)

Table 3: Loss/Accuracy By Epoch

epochtrn_lossval_lossaccuracy
00.2210010.16230.943355
10.2329990.1790430.941176
20.2244350.1488150.947712

Calling learn.sched.plot_lr() once again:

kaggle CNN Image

To unfreeze layers however, we will call unfreeze. We will also try differential rates for the respective layers.

learn.unfreeze()
lr=np.array([1e-4,1e-3,1e-2])
learn.fit(lr, 3, cycle_len=1, cycle_mult=2)

Table 4: Loss/Accuracy By Epoch

epochtrn_lossval_lossaccuracy
00.3235390.1784920.923747
10.2475020.1323520.949891
20.1925280.1289030.954248
30.1652310.1019780.962963
40.1410490.1063190.960784
50.1219470.1030180.960784
60.1074450.1009440.965142

Improved our model, 96.5% accuracy…

kaggle CNN Image

Above, we have the learning rate of the final layers. The learning rates of the earlier layers are fixed at the same multiples of the final layer rates as we initially requested (i.e. the first layers have 100x smaller, and middle layers 10x smaller learning rates, since we set lr=np.array([1e-4,1e-3,1e-2]).

To get a better picture, we can use Test time augmentation (learn.TTA()), that is we use data augmentation techniques on our validation set. Thus, by making predictions on both the validation set images and their augmented images, we will be more accurate.

Our confusion matrix:

kaggle CNN Image

Our final accuracy was 96.73% and upon submission to the public leader-board we got 98%.

Code Summary and Explanation Steps

Data Exploration:

  • Explore the data size and get an idea of how the images look like.
  • Check the distribution of image sizes. Resizing of Images (Standardizing) might be required to speed up the process.

Models Tweaking:

  • Run a quick model (smaller number of epochs) with precompute = TRUE, that is only updating the weights of last layer.
  • Evaluate the Performance by observing the train and validation loss and the overall accuracy.
  • Explore the Images of the most correct/incorrect classifications to understand if there are any visible patterns/reasons of wrong classification. It helps to get more comfortable with what the model is doing.
  • Find optimal Learning Rate using lr_find(). We want a learning rate where loss is improving.
  • Train last layer from precomputed activations for 1-2 epochs.
  • Use data augmentation and train the last layer again (cycle_len = 1).
  • Unfreeze layers and retrain the model. Set the earlier layers to 3x-10x lower learning rate than next higher layer.
  • Recheck the Learning Rate (lr_find).
  • Train full network with cycle_mult=2 until over-fitting.
  • Use Test time augmentation to get a better picture regarding the accuracy.


]]>
https://www.datasciencediscovery.com/index.php/2018/11/25/deep-learning-invasive-species/feed/ 0 554
Deep Learning Black box https://www.datasciencediscovery.com/index.php/2018/11/22/deep-learning-black-box/?utm_source=rss&utm_medium=rss&utm_campaign=deep-learning-black-box https://www.datasciencediscovery.com/index.php/2018/11/22/deep-learning-black-box/#respond Thu, 22 Nov 2018 08:55:46 +0000 http://datasciencediscovery.com/?p=529 Let’s try to take a sneak peak inside the black box of deep learning and try to build some intuition along the way. Deep Learning Series In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired […]]]>

Let’s try to take a sneak peak inside the black box of deep learning and try to build some intuition along the way.

Deep Learning Series

In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources and written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.

Inside the Black box

We have covered the origins and understood a little bit about the structure of neural networks in the previous articles. However, before we further dive into the math behind the working of neural networks, we need to polish our understanding of what is going on inside the black box.

Deep learning algorithms are mostly a black box. We do not know what patterns are being observed that trigger an activation function. We can make a guess for example when it classifies a “Dog” in Cats vs Dogs dataset, it probably saw the ears or the shape of the dog’s face. But this uncertainty would not work when these algorithms are being used in self driving cars. In such use cases we need to know why the algorithm is working the way it is.

Feature Visualization

In Neural networks it is not necessary that a Neuron will be fired up for all the images. That is, a neuron will be activated only for a select features that are present in the input images.

Well, some light was shed on feature visualization by Matthew D. Zeiler and Rob Fergus in a research paper that is available here.

They put together a novel method to decode these features. First, they trained a normal CNN ( convolutional neural network a type of Neural Networks) to classify images. At the same time, they also trained a backward looking network.

Basic CNN

To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached.

let’s break it down.

If we pass image A into our CNN and it passes through layer K and only neuron M ends up being activated. Now, the backward looking network (Deconvnet) will be used to reconstruct the status in the previous step. That is, based on the output of layer K, we have set all other activations to 0 except M and now we are trying to revert whatever activity happened in the previous layer (neuron M).

Thus, our goal here is to understand what feature activates a neuron. Let’s say we are training our network on cats vs dogs dataset, now we start focusing on a single neuron and ignoring all other neurons with the purpose of understanding what activated that neuron. Maybe the dog’s ears are the defining feature for this neuron and that is what this neuron looks for in every image. Now using this methodology we can explore how things are working layer by layer. In the images below you will notice the actual images and what the neural network is observing:

Layers CNN
Layers CNN
Layers CNN
Layers CNN

Explanation:

In layer 1 CNN is able to identify color gradients and as we look at deeper layers, more complex patterns. The patterns are emerging going from gradients to edges/shapes to complex features like eyes.

It was necessary to supply several images to the network to see what activates the selected neuron. This becomes computationally intensive. There is another way, what if we supply a image created with random pixels and try to find out what would excite one particular neuron. We use an image similar to the one shared below, and run it through a neural network with only one neuron activated. The neural network is trying to understand how to change the color of each pixel to increase the activation of that neuron. More information is available here in the paper by Jason Yosinski.

Random Pixels CNN

So how do these activated images look like:

Random Pixels Activate CNN

Hope this gives you a sneak peak into how neural networks work especially with image data. If you wish to further explore the same, please have a look at this amazing blog post at distill pub.

What about structured Data?

Let’s say you are working on predicting the future sales of a retail store. A neuron might get activated or give higher weights to certain inputs. For example: the variables item category and season of sale. For simplicity try to think of this as weights given in linear regression. Why would these two variables cause the needle to shift?

Well, maybe there are some seasonal products in the data set which activate that particular neuron. Similarly we can develop some intuition of which variables are influencing our neurons.

What’s Next?

Mechanics of Deep Learning:

Let’s dive into the core of neural networks. Understand the concept of Gradient Descent and Back-propagation to get some idea of how Neural Networks work. Warning some math involved! Don’t worry, we will first try to explain it in an intuitive manner and then explore some math behind it.

]]>
https://www.datasciencediscovery.com/index.php/2018/11/22/deep-learning-black-box/feed/ 0 529
Deep Learning Activation Function https://www.datasciencediscovery.com/index.php/2018/11/21/deep-learning-activation-function/?utm_source=rss&utm_medium=rss&utm_campaign=deep-learning-activation-function https://www.datasciencediscovery.com/index.php/2018/11/21/deep-learning-activation-function/#respond Wed, 21 Nov 2018 16:33:34 +0000 http://datasciencediscovery.com/?p=524 Understand how an activation function behaves for different neurons and connect it to the grand architecture. The concept of different type of activation functions explored in detail. Deep Learning Series In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This […]]]>

Understand how an activation function behaves for different neurons and connect it to the grand architecture. The concept of different type of activation functions explored in detail.

Deep Learning Series

In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources and written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.

Activation Function

We had briefly discussed activation functions in the blog regarding the architecture of neural networks. But lets improve are understanding by diving into this topic further.

Activation functions are essentially the deciding authority, on whether the information provided by a neuron is relevant or can be ignored. Drawing a parallel to our brain, there are many neurons but all the neurons are not activated by an input stimuli. Thus, there must be some mechanism, that decides which neuron is being triggered by a particular stimuli. Let’s put this in perspective:

Basic NN

The output signal will be attained only if the neuron is activated. Consider, the neuron A that is providing the weighted sum of inputs along with a bias term.

Math NN

Thus, we are simply doing some linear matrix transformations and as mentioned in the deep learning architecture blog, just doing a linear operation is not strong enough. We need to add some Non-Linear Transformations, that is where Activation functions come into the picture.

Also, the range of this function is -inf to inf. When we get an output from the neural network, this range does not make sense. For example if we are classifying images as Dogs or Cats, what we need is a binary value or some probability thus we need a mapping to a smaller space. The output space could be between 0 to 1 or -1 to 1 and so on depending on the choice of function.

So to summarize we need the activation functions to introduce non-linearities, get better predictions and reduce the output space.

Now, let’s do a simple exercise, given this idea regarding activation of neurons how would you come up with an activation function. What we want is a binary value suggesting if a Neuron is activated or not.

Types of Activation Functions

Step Function

First thing that comes to mind is defining a threshold. If the value is beyond a certain threshold declare it as activated. Now, if we are defining this function for the space 0 to 1, we can easily say for any value above 0.5 consider the neuron activated.

Wow! We have our first activation function. What we have defined here is a Step Function also known as Unit or Binary Step function.

Activation NN

Advantage

  • If we are creating a binary classifier, it makes sense as we eventually require a value of 0 or 1 so having such a function at the final layer will be cool.
  • Extremely simple.

Disadvantage

  • Impractical, as in most use cases your data has multiple classes to deal with.
  • Gradient of the step function is zero, making it useless. During back-propagation, when the gradients of the activation functions are sent for error calculations to improve and optimize the results a gradient of zero means no improvement of the models.

Thus we want that the Activation function is differentiable because of how back-propagation works.

Now let’s take look at the large picture. There are multiple neurons in our neural network. We had discussed in the blog regarding the intuitive understanding of neuron networks how neuron networks look for patterns in images. If you haven’t read it or have some idea about it, all you need to know is that different neurons might select or identify different patterns. Revisiting the Dog vs Cat image identification example, if multiple neurons are being activated what will happen?

Linear Function

With the use case defined above, let’s try to a linear function as we have figured out that a binary function didn’t help much. f(X) = CX, straight line function where activation is proportional to the weighted sum from neuron. If more than one neuron gets activated then we can take the max value for the neuron activation values, that way we have only 1 neuron to be concerned about.

Oh wait! the derivative of this function is a constant value. f’(X) = C.What does that mean?

Well, this means that every time we do a back propagation, the gradient would be the same and there is no improvement in the error. Also, with each layer having a linear transformation, the final output is also a linear transformation of the input. Further, a space of (-inf,inf) sounds difficult to compute. Hence, not desirable.

Sigmoid Function

Let’s pull out the big guns. A smoother version of the step function. It is non-linear and can be used for non-binary activations. It is also continuously differentiable.

Activation Sigmoid NN

Most values lie between -2 and 2. Further, even small changes in the value of Z results in large changes in the value of the function. This pushes values towards the extreme parts of the curve making clear distinctions on prediction. Another advantage of sigmoid activation is that the output lies in the range between 0 and 1 making an ideal function for use cases where probability is required.

That’s sounds all good! Then what’s the issue? After +3 and -3 the curve gets pretty flat. This means that the gradient at such points will be very small. Thus, the improvement in error will become almost zero at these points and the network learns slowly. This is known as vanishing gradients. There are some ways to take care of this issue. Others issue are the computation load and not being zero-centered.

Tanh Function

Hyperbolic tangent activation function is very similar to the sigmoid function.

Activation tanh NN

Compare the formula of tanh function with sigmoid: tanh(x) = 2 sigmmoid(2x) – 1

To put it in words, if we scale the sigmoid function we get the tanh function. Thus, it has similar properties to the sigmoid function. The tanh function also suffers from the vanishing gradient problem and therefore kills gradients when saturated. Unlike sigmoid, tanh outputs are zero-centered since the scope is between -1 and 1.

ReLU Function

To truly address the problem of vanishing gradients we need to talk about Rectified linear unit (ReLU) and Leaky ReLU. ReLU (rectified linear unit) is one of the most popular function which is used as hidden layer activation function in deep neural network.

Activation relu NN
g ( z ) = max { 0 , z }

Activation relu derivative NN

When the input x < 0 the output is 0 and if x > 0 the output is x. As you can see that a derivative exists for all values except 0. The Left derivative is 0 while right derivative is 1. That’s a new issue, how will it work with gradient descent. In practice at 0 it is more likely that the true value close to zero or rounded to zero, in other words it is rare to find this issue in practice. Software implementations of neural network training usually return one of the one-sided derivatives instead of raising an error. ReLU is computationally very efficient but it is not a zero-centered function. Another issue is that if x < 0 during the forward pass, the neuron remains inactive and it kills the gradient during the backward pass. Thus weights do not get updated, and the network does not learn.

Leaky RELU

This is a modification of ReLU activation function. The concept of leaky ReLU is when x < 0, it will have a small positive slope of 0.1. This feature eliminates the dying ReLU problem, but the results achieved with it are not consistent. Though it has all the characteristics of a ReLU activation function, i.e., computationally efficient, converges much faster, does not saturate in positive region.

Activation leaky relu NN
f(x) = max(0.1*x,x)

There are many different generalizations and variations to ReLU such as parameterized ReLU.

Softmax Function

To tie things together let’s discuss one last function. It is often used in the final layer of Neural networks. The softmax function is also a type of sigmoid function that is often used for multi-class classification problems.

Activation Softmax NN

Look at the numerator, as we are taking an exponential of Zj, it will result in a positive value. Further, even small changes in Zj result in largely variant values (exponential scale). The denominator is the summation of all Exp(Zj) that is the probabilities end up adding to 1.

This makes it perfect for classifying multiple classes. For example, if we want to detect multiple labels in an image such as for satellite images of a landscape you might find water, rain forests, land and so on as the labels.

Summary

Many activation functions and their characteristics have been discussed here. But the final question is when to use what?

There is no exact rule for choice rather the choice is determined on the nature of your problem. Keep the characteristics of the activation functions in mind and choose the one that suits your use case and will provide faster convergence. The most used in practice is to use ReLU for the hidden layers and sigmoid (binary classification example Cat vs Dog) or softmax (multi class classification) in the final layer.

]]>
https://www.datasciencediscovery.com/index.php/2018/11/21/deep-learning-activation-function/feed/ 0 524