What are Neural Networks made of? Understanding the different components and the architecture of Neural Networks.
Deep Learning Series
In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.
Architecture
The introduction to neural networks and a general idea behind the inspiration for such an algorithm has been discussed in the previous post. We will talk about the building blocks of neural networks in detail in future posts, but in this post we focus on the overall structure of Neural networks and discuss some of the components.
Terms
- Size: The number of nodes in the model.
- Width: The number of nodes in a specific layer.
- Depth: The number of layers in a neural network.
- Capacity: The type or structure of functions that can be learned by a network configuration.
- Architecture: The specific arrangement of the layers and nodes in the network.
- Input Layer: Input variables, sometimes called the visible layer.
- Hidden Layers: Layers of nodes between the input and output layers.
- Output Layer: A layer of nodes that produce the output variables.
Now, let’s briefly discuss the elements of a neural network.
The Elements of a Neuron
Neural networks are a set of algorithms, inspired by the working of the human brain. These algorithms are designed to recognize patterns. Neural networks consist of layers which are made of nodes. These nodes are where all the calculations happen.
- Neurons: The term artificial neuron is used in the literature interchangeably with: node, unit, perceptron, processing element or even computational unit. A neuron is the component of neural network where a mathematical transformation takes place. It uses either input data or resultant data from other neurons. It consists of the following components:
Weights:
Each input has its own relative weight. Weights are adaptive coefficients that determine the intensity of the input signal as registered by the artificial neuron. Using techniques like back-propagation discussed here, the weights are updated with each iteration in order to reduce the error. For now, all we need to know is that the weights will be updated using special algorithms and that these algorithms require differentiation. So weights will be updated overtime but when we start training a neural network but:How do we initialize the weights?
Zero Initialization
- Initialize to 0, that would reduce this model to a linear one. When we differentiate the weighted linear equation, we get the value 0.
- The bigger issue is that with all weights having the same value, the differentiation will end with the same value. This means all the weights have the same values in the subsequent iteration. If all of the weights are the same, they will all have the same error and the model will not learn anything – there is no source of asymmetry between the neurons.
Random Initialization
- Random Initialization sounds like a good option, but it comes with its own problem. In this method, the weights are initialized very close to zero, but randomly. It is a better option than zero initialization as the initial weights here are close to the ‘best guess’ expected value and the symmetry has also been broken enough for the algorithm to work. Then what is the problem. It is not necessarily true that smaller numbers will work better, especially in a deep neural network.
- The gradients calculated in back-propagation are proportional to the weights. We know that as we back propagate the gradient becomes smaller. Now imagine if we initialized very small weights, it will lead to very small gradients and can lead to the network taking much longer to train or even never completing.
- To give you an idea, with back-propagation we start with the output layer and go back layer by layer. With each layer we are calculating gradients (For simplicity let’s say differentiating some function). If my gradients are small that is the rate of change is small, the pace at which we are updating the weights is very slow. That is the problem.
He-et-al Initialization
He-et-al Initialization In this method, the weights are initialized while keeping in mind the size of the previous layer. That is we are taking into account the number of neurons in the previous layer. This helps attain a global minimum of the cost function faster.The weights are still random but differ in range. This initialization is more controlled here. More details about this technique are available here.
There are several techniques which can be used for initialization but the techniques mentioned here will give you some idea of how the weights as a component fits in neural networks.
Functions & Components
Summation Function
- Summation Function: The inputs and corresponding weights are vectors which can be represented as (x1, x2 … xn) and (w1, w2 … wn). The total input signal is the dot product of these two vectors. The result is a single number (x1 * w1) + (x2 * w2) +…+(xn * wn). The input and weighting coefficients can be combined in many different ways before passing on to the transfer function. Now, keep in mind that the summation function is only nomenclature. The function can be any kind of aggregation such as the minimum, maximum, majority, product or any other normalizing algorithm. The network architecture and paradigm determine the specific algorithm for combining neural inputs.
Activation Function
- Transfer Function: This is also known as an activation function. Basically in the activation function the summation can be compared with some threshold (based on a function) to determine the neural output. If the sum is greater than the threshold value, the processing element generates a signal and if it is less than the threshold, no signal is generated.
Scaling and Limiting
- Scaling and Limiting: Depending on the structure, there can be an additional step that the result can be passed through. This scaling simply has the following function (scale factor * transfer value + an offset). Limiting is the mechanism which insures that the scaled result does not exceed an upper, or lower bound.
Output Signal
- Output Signal: Final touch here is the output signal. A neuron will have a single output that can be transferred to the neurons in the next layer. Neurons are allowed to compete with each other in some architectures. This competition determines which artificial neuron will be active or provides an output and is involved in the learning process.
Error Function
- Error Function and Back-Propagated Value: Difference between the current output and the desired output is calculated as an error. A mathematical function that is the loss/error function is used which can be the square of the error or some other formula. Each error function has it’s use case. Softmax Cross Entropy is one such example. Softmax function is usually used when we have a multi-label classification or many classes in our use case. We touch the working of this function in more detail in a later post. The error is propagated backwards to a previous layer. This back-propagated value can be either the error, or the error scaled in some manner (often by the derivative of the transfer function).
Learning Rate
- Learning Rate: Learning rate determines how fast weights are updated. In general, you want to find a learning rate that is low enough that the network converges to something useful, but high enough that you don’t have to spend a lot of time training it. Learning rate might be different for different layers of a neural network.
Layers
A neural network is the grouping of neurons into layers and there can be many layers between the input and output layer. Most applications require networks that contain at least the three layers – input, hidden, and output. Each neuron in the hidden layer will be connected to all the neurons in the previous layer. We can start with these two types of basic perceptrons. They feed information from the front to the back and therefore are called Feed Forward networks.
Single Layer Feed Forward Neural Network consists of a single layer, that is it will only have the input and output layer. A single-layer perceptron can only be used for very simple problems such as classification classes separated by a line.
Multi Layer Feed Forward Neural Network consists of one or more hidden layers, whose computation nodes are called hidden neurons or hidden units. A Multilayer Perceptron can be used to represent convex regions thus it can separate and work in some high-dimensional space.
What architecture works for me?
Now, we know it makes sense to have multiple layers especially when dealing with images or complex data.
How do we decide on what architecture to use? How many hidden layers should be used?
- Larger Neural Networks can represent more complicated functions. Even though it will improve accuracy, but it also increases the chance of over-fitting. However, there are other methods that help control over-fitting (such as L2 regularization, Dropout) so number of neurons should not be a control parameter.
- More neurons and layers might add to the computational load.
There has to be a trade-off and there is no definite answer to this question. However, I can suggest you the following:
Experimentation: Find out what works best for your data given the computational constraints.
Intuition or Google: Based on experience of past models used you can come up with an answer. If you have a standard DL problem such as an image classification, you can Google to find out what others have used (Resnet,vgg and so on).
Search: Try random or grid search for different architectures and choose the one giving the best score.
There are several different architectures shown in the image below. To summarize what are the parameters that govern or define the architecture:
- The Elements of a Neuron (Type of activation function, scaling, limiting……)
- The Size, Width and Depth of the Neural Network.
- How different neurons are connected to each other. For example the Hopfield network (HN) is a network where every neuron is connected to every other neuron.
What’s Next?
Inside the Black box: What is going on inside this Black box algorithm? Trying to build intuition and understanding of what is going on in the different layers of a neural network. Let’s continue with the learning in this next article where we take a closer look at what happens with the different neurons and respective layers of a Neural Network.