Understand how an activation function behaves for different neurons and connect it to the grand architecture. The concept of different type of activation functions explored in detail.

### Deep Learning Series

In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources and written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.

## Activation Function

We had briefly discussed activation functions in the blog regarding the architecture of neural networks. But lets improve are understanding by diving into this topic further.

Activation functions are essentially the deciding authority, on whether the information provided by a neuron is relevant or can be ignored. Drawing a parallel to our brain, there are many neurons but all the neurons are not activated by an input stimuli. Thus, there must be some mechanism, that decides which neuron is being triggered by a particular stimuli. Let’s put this in perspective:

The output signal will be attained only if the neuron is activated. Consider, the neuron A that is providing the weighted sum of inputs along with a bias term.

Thus, we are simply doing some linear matrix transformations and as mentioned in the deep learning architecture blog, just doing a linear operation is not strong enough. We need to add some **Non-Linear Transformations**, that is where Activation functions come into the picture.

Also, the range of this function is -inf to inf. When we get an output from the neural network, this range does not make sense. For example if we are classifying images as Dogs or Cats, what we need is a binary value or some probability thus we need a mapping to a smaller space. The output space could be between 0 to 1 or -1 to 1 and so on depending on the choice of function.

So to summarize we need the activation functions to introduce non-linearities, get better predictions and reduce the output space.

Now, let’s do a simple exercise, given this idea regarding activation of neurons how would you come up with an activation function. What we want is a binary value suggesting if a Neuron is activated or not.

## Types of Activation Functions

### Step Function

First thing that comes to mind is defining a threshold. If the value is beyond a certain threshold declare it as activated. Now, if we are defining this function for the space 0 to 1, we can easily say for any value above 0.5 consider the neuron activated.

Wow! We have our first activation function. What we have defined here is a **Step Function** also known as **Unit or Binary Step function**.

**Advantage**

- If we are creating a binary classifier, it makes sense as we eventually require a value of 0 or 1 so having such a function at the final layer will be cool.
- Extremely simple.

**Disadvantage**

- Impractical, as in most use cases your data has multiple classes to deal with.
- Gradient of the step function is zero, making it useless. During back-propagation, when the gradients of the activation functions are sent for error calculations to improve and optimize the results a gradient of zero means no improvement of the models.

Thus we want that the Activation function is differentiable because of how back-propagation works.

Now let’s take look at the large picture. There are multiple neurons in our neural network. We had discussed in the blog regarding the intuitive understanding of neuron networks how neuron networks look for patterns in images. If you haven’t read it or have some idea about it, all you need to know is that different neurons might select or identify different patterns. Revisiting the Dog vs Cat image identification example, if multiple neurons are being activated what will happen?

### Linear Function

With the use case defined above, let’s try to a linear function as we have figured out that a binary function didn’t help much. f(X) = CX, straight line function where activation is proportional to the weighted sum from neuron. If more than one neuron gets activated then we can take the max value for the neuron activation values, that way we have only 1 neuron to be concerned about.

Oh wait! the derivative of this function is a constant value. f’(X) = C.What does that mean?

Well, this means that every time we do a back propagation, the gradient would be the same and there is no improvement in the error. Also, with each layer having a linear transformation, the final output is also a linear transformation of the input. Further, a space of (-inf,inf) sounds difficult to compute. Hence, not desirable.

### Sigmoid Function

Let’s pull out the big guns. A smoother version of the step function. It is non-linear and can be used for non-binary activations. It is also continuously differentiable.

Most values lie between -2 and 2. Further, even small changes in the value of Z results in large changes in the value of the function. This pushes values towards the extreme parts of the curve making clear distinctions on prediction. Another advantage of sigmoid activation is that the output lies in the range between 0 and 1 making an ideal function for use cases where probability is required.

That’s sounds all good! Then what’s the issue? After +3 and -3 the curve gets pretty flat. This means that the gradient at such points will be very small. Thus, the improvement in error will become almost zero at these points and the network learns slowly. This is known as **vanishing gradients**. There are some ways to take care of this issue. Others issue are the computation load and not being zero-centered.

### Tanh Function

Hyperbolic tangent activation function is very similar to the sigmoid function.

Compare the formula of tanh function with sigmoid: tanh(x) = 2 sigmmoid(2x) – 1

To put it in words, if we scale the sigmoid function we get the tanh function. Thus, it has similar properties to the sigmoid function. The tanh function also suffers from the vanishing gradient problem and therefore kills gradients when saturated. Unlike sigmoid, tanh outputs are zero-centered since the scope is between -1 and 1.

### ReLU Function

To truly address the problem of vanishing gradients we need to talk about Rectified linear unit (ReLU) and Leaky ReLU. ReLU (rectified linear unit) is one of the most popular function which is used as hidden layer activation function in deep neural network.

g ( z ) = max { 0 , z }

When the input x < 0 the output is 0 and if x > 0 the output is x. As you can see that a derivative exists for all values except 0. The Left derivative is 0 while right derivative is 1. That’s a new issue, how will it work with gradient descent. In practice at 0 it is more likely that the true value close to zero or rounded to zero, in other words it is rare to find this issue in practice. Software implementations of neural network training usually return one of the one-sided derivatives instead of raising an error. ReLU is computationally very efficient but it is not a zero-centered function. Another issue is that if x < 0 during the forward pass, the neuron remains inactive and it kills the gradient during the backward pass. Thus weights do not get updated, and the network does not learn.

### Leaky RELU

This is a modification of ReLU activation function. The concept of leaky ReLU is when x < 0, it will have a small positive slope of 0.1. This feature eliminates the dying ReLU problem, but the results achieved with it are not consistent. Though it has all the characteristics of a ReLU activation function, i.e., computationally efficient, converges much faster, does not saturate in positive region.

f(x) = max(0.1*x,x)

There are many different generalizations and variations to ReLU such as parameterized ReLU.

### Softmax Function

To tie things together let’s discuss one last function. It is often used in the final layer of Neural networks. The softmax function is also a type of sigmoid function that is often used for multi-class classification problems.

Look at the numerator, as we are taking an exponential of Zj, it will result in a positive value. Further, even small changes in Zj result in largely variant values (exponential scale). The denominator is the summation of all Exp(Zj) that is the probabilities end up adding to 1.

This makes it perfect for classifying multiple classes. For example, if we want to detect multiple labels in an image such as for satellite images of a landscape you might find water, rain forests, land and so on as the labels.

## Summary

Many activation functions and their characteristics have been discussed here. But the final question is when to use what?

There is no exact rule for choice rather the choice is determined on the nature of your problem. Keep the characteristics of the activation functions in mind and choose the one that suits your use case and will provide faster convergence. The most used in practice is to use ReLU for the hidden layers and sigmoid (binary classification example Cat vs Dog) or softmax (multi class classification) in the final layer.