Neural Networks

Activation Functions in ANNs (Conclusion)

In my last article Activation Functions in ANNs, we discussed on few activation functions, now let’s explore more on some other available activation functions.

Tanh Function

These are scaled sigmoid function which is similar to sigmoid functions.

f(x)\quad =\quad tanh(x)\quad =\quad \frac { 2 }{ 1\quad +\quad { e }^{ -2x } } \quad -\quad 1

Or

tanh(x)\quad =\quad 2\quad sigmoid(2x)\quad -\quad 1

It is nonlinear so we can have more than one layer of neurons depending upon the requirement. Its range is (-1, 1). The gradient of tanh is steeper than sigmoid function or in other words when we differentiate the values are steeper. Similar to Sigmoid function tanh also has vanishing gradient problem.
So depending upon the business requirement of gradient strength the choice between and sigmoid and tanh function is made. Tanh is also very popular activation function like sigmoid function.

Rectified Linear Units (ReLu) Functions

f(x)\quad =\quad max(0,\quad x)

The ReLu function is as shown above. It gives an output x if x is positive and 0 if it is negative. It is a continuous function but not everywhere differentiable. It is also a very popular activation function. The range of ReLu is (0, ∞).The ReLU function is more effective than the widely used logistic sigmoid and the hyperbolic tangent since it effectively reduces the computation cost. The advantage of ReLu is it greatly accelerates the convergence of stochastic gradient descent compared to the sigmoid or tanh functions. It is achieved due to its linear, non-saturating form Contrast to Tanh or sigmoid activation functions which involve exponentials which are computationally very expensive operations. The ReLu can be implemented by simply thresholding a matrix of activations at zero.

The disadvantage of ReLu activation function is it can be fragile during training. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any data point again. For negative x, the gradient can go towards zero because of which the weights will not get adjusted during descent. That means those neurons which go into that state will stop responding to variations in input. This is known as dying ReLu problem. This problem can cause several neurons to not respond and just die, making a substantial part of the network passive. There are variations in ReLu to address this issue by simply making the horizontal line into the non-horizontal component.

ReLu function is less computationally expensive than tanh and sigmoid because it involves less complex mathematical operations. That is a very good point to consider when we are designing deep neural networks.

Leaky ReLu Function

Leaky ReLu is the advance version of ReLu Function which addresses the dying ReLu Problem. Instead of the function being zero when the value of x is less than 0, a leaky ReLU will instead have a small negative slope. The function computes as mentioned below:-

f(x) = ax when x < 0, and x otherwise

For example y = 0.01x for x<0 will make it a slightly inclined line rather than horizontal line. The main idea of Leaky ReLu is to let the gradient be non-zero and recover during training eventually.
The range of Leaky ReLu is -∞ to ∞.

Parametric ReLu Function

The slope in the negative region can also be made into a parameter of each neuron, in this case, it is a Parametric ReLU, which take this idea further by making the coefficient of leakage into a parameter that is learned along with the other neural network parameters.

F(x) =αx when (x<0)

Where α is a small constant which is smaller than 1. The formula of Parametric ReLU is equivalent to:-

F(x) =max(x, αx)

The range of Parametric ReLu is -∞ to ∞.

Swish Activation Function

Recently researcher from Google has published a paper on an activation function which is known as the Swish activation function. Swish is represented by a function,

 f(x) =x⋅sigmoid (βx).

Where beta is a learnable parameter, we can see that when beta = 0 the sigmoid part is always 1/2, so f(x) becomes linear. On the other hand, if the beta is very high, the sigmoid part becomes like a binary activation i.e. 0 for x0. Thus, f(x) approaches the ReLU function. On Google research Paper it is also mentioned that Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2. Swish can address the dying ReLu problem so can be used in place of ReLu.

However Swish would take more computation time compared to ReLu because the cost of calculations for Swish is more than ReLu.

Softmax

The Softmax function which is Used for multi-classification neural network output or normalized exponential function, in mathematics, is a generalization of the logistic function that “squashes” a K-dimensional vector z from arbitrary real values to a K-dimensional vector σ(z) of real values in the range [0,1] that add up to 1. The function is given by

softmax equation

Softmax function’s output can be used to represent a probability distribution over K different possible Outcomes using Probability theory. In other words, the output of Softmax is a categorical distribution i.e. gradient log normalizer of the categorical probability distribution.

Selecting Appropriate Activation Function

  • Sigmoid Activation functions are used to solve classification problems however the computation cost in case of sigmoid is more compared to ReLu.
  • ReLu over fits more easily than sigmoid functions and dropout techniques can be used to reduce overfitting by preventing complex co-adaptations on training data.
  • Training time is faster for ReLu than sigmoid due to less numerical computation.
  • Sigmoid suffers Vanishing gradient problem, however, ReLu don’t suffer from the vanishing gradient problem.
  • Sigmoid activation function should be avoided in deep learning networks.
  • Since ReLu addresses the vanishing gradient problem they can be used in deep learning networks.
  • Softmax can be used for any number of classes and useful in case of object recognition.
  • For larger datasets, we can choose between ReLU and Swish depending upon accuracy or Cost of Computation.

Conclusion

There is nothing called best activation function which can be selected by default for all ANNs. We can start from one activation function and move to others to find best activation function based on business Problem and Size of Dataset.

Tagged , ,