A Complete Guide to train Multi-Layered Perceptron Neural Networks

9 min readJun 30, 2021

This article will take you through a complete journey in training Multi-Layered Perceptron Neural Networks from knowing what is perceptron to steps involved in training MLP neural network.

Map to a great voyage:

Introduction for perceptron
What is Multi-Layered Perceptron Neural Networks
Data preprocessing
Training MLP Networks
Choosing an Activation function
Ways of performing Weight Initialization
Regularization of Model — applying Batch normalization and Dropout layer
Choose your best Optimizer
Monitoring gradients
Hyperparameter tuning using Keras Tuner

1. Introduction for perceptron

A perceptron is a single-layer neural network inspired from biological neurons. The so-called dendrites in biological neuron are responsible for getting incoming signals and cell body is responsible for the processing of input signals and if it fires, the nerve impulse is sent through the axon.

For Artificial neurons, we are using similar concept in a mathematical way. Artificial neurons takes input and assign them weights and sum them up and send it to a activation function and produce output.

2. What is Multi-Layered Perceptron NN

In this complex world, a perceptron gives us only linear relationship between inputs and output. When we need to solve more complex problems, a more complex model is needed which can give us non-linear relationship. This is solved by using Multi-Layered Perceptron NN.

A Multi-Layered Perceptron NN can have n-number of hidden layers between input and output layer. These hidden layer can have n-number of neurons, in which the first hidden layer takes input from input layer and process them using activation function and pass them to next hidden layers until output layer. Every neuron in a hidden layer uses a non-linear activation function. MLP uses a supervised learning technique called backpropagation for training.

**Multi-Layered Perceptron Neural Network**

Here we have solved a simple mathematical problem using a MLP neural network. This cannot be solved using a single perceptron. Here for example I have used simple Mathematical functions in place of activation functions.

3. Data Preprocessing

We cannot directly put any form of data into a Neural Network. The data we feed into a NN should be numerical(real values). If you have categorical data, it should be converted into numerical values using techniques like one-hot encoding.

If a column has real values but ranges with huge difference(eg:1 to 10000), it should be scaled down using techniques like normalization, standardization.

4. Training MLP Networks

A deep MLP neural network tries to learn underlying pattern or to map inputs and outputs using weights by training with given data. So to achieve this we use a optimization technique called Stochastic Gradient Descent. We calculate loss using a loss function and calculate the derivate and we update the weights during backpropagation. The main goal is to minimize the difference(loss) between predicted output and actual output.

Loss function, Chain rule and Memoization:

In SGD, we pass one input at a time. It passes through the network of neurons(each neuron with a activation function) and produces an output. We calculate the loss by comparing the predicted value(output) with actual value. The loss function is chosen based on the problem we are solving (regression, binary or multi class classification). Then we calculate the derivative using chain rule. After calculating a derivative it is stored in memory so it can be used again. This technique is called Memoization. This helps in faster calculation as it helps in avoiding calculating same derivative again.

Backpropagation:

Backpropagation can be called as the combination of chain rule and memoization. It is during backpropagation the derivatives are calculated.

Backpropagation using SGD:

Initialize weights
for each Xi ,

i )— pass Xi forward through the network

ii )— calculate loss

iii )— compute derivative and update weights.

3. Repeat step 2 until convergence.

A small overview of underlying mathematics:

**MLP with inputs, weights, activation function, output, loss function**

5. Choosing an Activation function

The input is passed into neurons, where it is first undergoes a linear transformation and passed over an Activation function which introduces non-linearity in the network. The output from this is then passed to next layer.

Sigmoid Activation Function:

Sigmoid function is the logistic function used in logistic regression algorithm. It gives output range 0 to 1. We need derivative of sigmoid because when we backpropagate through network we need derivatives of these activation function.

Tanh Activation Function:

The hyperbolic tangent activation function (tanh). Both sigmoid and tanh are S-shaped curves, with output of tanh has range -1 to 1.

Vanishing gradient and Exploding gradient problem:

When we backpropagate through the network we multiply n number of derivatives. Derivative of sigmoid gives output in range 0 to ~0.25. Multiplying many derivatives with this value will give a very small value. So when we update weight these very small value do not help to update weight and result convergence of weight very soon without reaching its actual convergence value. This is called Vanishing gradient problem, where the gradients actually vanish before convergence.

This can be overcome using tanh. The derivative of tanh gives output in range 0 to 1. But this may result in exploding gradient problem in some cases. Exploding gradient is if derivative are too large the weights update will be very large which can result in unstable network.

Rectified Linear Unit (ReLU) Activation Function:

To overcome the vanishing gradient problem, here comes our ReLU. ReLU is a simpler activation function which returns same value if input is positive and returns 0 if negative. The derivative of ReLU is much simpler. The slope for positive values is 1 and 0 for negative values. This help to overcome vanishing gradient problem as there are no small values which we get in sigmoid function.

But here we have 0 which will lead to problem called dead activation. There is small catch to overcome this called Leaky ReLU. This uses simple modification for ReLU which returns same value if input(x) is positive and returns 0.01(x) if negative.

Advantages: It is computationally efficient as it uses only simple mathematical calculations. Convergence is faster as there is no vanishing gradient problem.

6. Ways of performing Weight initialization

Weights are used to indicate the importance of the input value. If one input has higher weight than other, it means that the former plays a more important/useful role in predicting the output. During training, these are given into feed forward network with some initial values which are updated during backpropagation.

Weights can be initialized in many different ways like selecting small random values between range [0,0.3] , [0,1]. But some conditions should be taken into consideration like values should be small, not all weights should be same (which leads to symmetrical calculation in all neurons) and not all should be zero. But as weights play a major role, more ways of weight initialization are used. Some of them we use specific to activation function selected.

Xavier/Glorot Weight Initialization for Sigmoid:

Xavier initialization uses a random number generated from Uniform distribution so here(activations) variance of inputs and weights from previous layer are the same across every layer.

weight = U [(-√6)/√(in+out) ,√6/√(in+out)]

here, U is uniform distribution, ‘in’ is number of input to node, ‘out’ is number of output from node. There is another way of xavier initialization using normal distribution.

He Initialization for ReLU:

He initialization using normal distribution with mean as 0 and variance from below formula:

weight = N[0,σ]

where, σ=√(2/in)

here, ‘in’ is number of input to node.

7. Regularization of Model

Dropout Layer:

During training in a deep neural network, there are possibilities for network to over-fit on the training data. This is because when many neurons in a layer extract same information from input, it creates interdependency among the neurons. This can be overcome by using Dropout layer. This is kind of regularization technique to prevent over-fitting.

In Dropout layer, during training phase we randomly drop out or zero out some percentage of neurons in each layer. It can be placed after any hidden layer and also we can specify the dropout rate which determine the number of neurons to drop. But, during testing entire network is used without dropout.

Batch normalization:

During training, before providing input to feed forward network we normalize them and scale down the values to a particular range. Even though we pass normalized data, after passing through n number of hidden layers they are not in same scale. These batch input as they pass through multiple neurons with activation function their distribution changes. This is called internal covariance shift. This can be overcome by introducing a layer of Batch normalization. Equation for batch norm,

We calculate standardization on mini-batch of input and transform them using two parameters called scale and shift learned during optimization. Epsilon is added for numerical stability.

Advantages: Both dropout and batch norm help in faster convergence. This help to use larger learning rate.

8. Choose your best Optimizer

Optimizer are used to update parameters and to reduce loss efficiently. There are multiple optimizing technique which we use to find our global minima like Gradient descent, Stochastic gradient descent, Stochastic gradient descent with momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSProp, Adam. Here we will discuss most used optimizer.

Nesterov accelerated gradient:

Stochastic gradient descent takes more time to converge. So technique called momentum was added to accelerate conergence using exponential weighted average technique which add weights to gradient and prevent model in having deviations. Nesterov accelerated gradient uses this same momentum in a different way. Here we calculate momentum first and then calculate gradient on that. This helped in reaching convergence fast.

Adagrad:

Until here we are using constant learning rate. Sparse feature will have smaller gradient compared to dense. Most gradient in sparse will be 0 and convergence will happen soon. To prevent this updating learning rate according to feature was used. Adagrad help in changing learning rate in each step on own. Here learning rate is calculated using formula which calculate sum of all previous gradient. αt increases at each iteration which decrease φ’. For φ we use a small constant value. But when αt increase in larger rate φ’ can decrease much larger which lead to slower convergence.

Adadelta:

Adadelta is similar to adagrad but calculate learning rate by exponential decay average technique. This help in controlling growth of learning rate. Slower convergence issue from adagrad is solved by this.

Adam:

Adam use advantages of both RMSProp(momentum) and adadelta. We calculate moving average of first moment mean and second moment uncentered variance of gradients. This is most used optimizer as it helps in faster learning than all other optimizers.

9. Monitoring gradients

One of big issue with gradients is exploding gradient. Gradient clipping technique help to monitor them and prevent them by clipping them from exceeding threshold value. We take gradient vector and this is multiplied with threshold value which prevent growth of gradient to larger value.

10. Hyperparameter tuning using Keras Tuner

Hyperparameter tuning is one of the important process which help to design best model for data. We have Keras Tuner which help to find best set of hyperparameters. In the example, we try to tune number of layers, number neurons, learning rate, activation function for classification model. Keras tuner has four tuners, random search, hyperband, bayesian optimization, sklearn.

What to learn more,

Loss function: Explore what are the different types of loss function used for regression, classification problem.

TensorBoard: It is a visualization tool provided with TensorFlow. It help to track metrics like loss and accuracy by visualizing them in graph and many more.

GitHub repository and LinkedIn:

GitHub , LinkedIn