In an ANN the activation function of a node is defined as the threshold after which the node will produce an output given an input or set of inputs. Activation functions can be linear or non-linear but mostly nonlinear functions are being used in ANNs. This is a very important in the way a network learns because in light of the fact that not all data is similarly important. Some of the data from the input are just insignificant so activation functions help the neural networks to do this segregation. They help the network use the useful information and suppress the insignificant data points of input data.
Let us explore more on activation function, how they work and which activation functions fit well into what kind of Business problem.
Overview of neural networks
The neural system of the human body consists of three components:-the receptors, the neural network, and the effectors. The receptors receive the stimuli either internal or from the external world, then pass the information into the neurons in a form of electrical impulses. The inputs then processed by the neural network to make a decision, which is converted to outputs. At the end, the effectors translate electrical impulses from the neural network into responses to the outside environment. The incoming signal from each synapse to the neuron is either excitatory or inhibitory, which means helping or hindering in firing. The condition of causing firing is that the excitatory signal should exceed the inhibitory signal by a certain amount in a short period of time, called the period of latent summation. In other words, we can say when the total weight of the synapses that receive impulses in the period of latent summation exceeds the threshold then the neuron fires.
For example, if a mosquito is biting us in our upper arm region, then the afferent neurons (sensory Neurons) will take this information to Brain. Based on the input brain will find out the area where the mosquito is present and immediately efferent neurons will carry information for hand muscles and with a blink of an eye, the mosquito will be killed. So how the brain decides it was a mosquito? Obviously from the input signal sent by sensory or afferent neurons and then decided what to do.
When we replicate the similar model in ANNs we use activation functions for models to decide whether neurons need to be activated or the fed information should be ignored.
Based on the McCulloch-Pitts model, the general form an artificial neuron can be described in as shown in the figure. In the first step, the linear combination of inputs is calculated. Each value of input array is associated with its weight value, which ranges between 0 and 1. The summation function also takes an extra input value with weight value of 1 to represent bias of a neuron. The summation function then performed.
The sum-of-product value is then passed to the next step to perform the activation function which generates the output from the neuron. The activation function suppresses the amplitude of the output in the range of (0 to 1) or (1 to -1).The behavior of the activation function will describe the characteristics of an ANN model.
Different types of activation functions
This function is denoted by f(x) = x for all x Single layer neural networks make use of a step function while converting a continuously varying input function to a binary output (0 or 1) or a bipolar output (1 or -1).
Binary Step Function
This is a threshold based classifier i.e. whether or not the neuron should be activated. If the value Y is above a given threshold value then the neuron would be activated else it won’t be activated.
A binary step function with a threshold T is given by f(x) = 1 if x >= T and f(x) = 0 if x < T.
It is useful while creating a binary classifier however it fails for the scenario where we have multiple classes in our classification problem as it is harder to train and converge. The gradient of the step function is zero. So it becomes useless during back-propagation when the gradients of the activation functions are sent for error calculations to optimize the results.
The Linear function is a straight line function where activation is proportional to input i.e. weighted sum from the neuron. When we do differentiation for this function, the derivative with respect to x is c i.e. the derivative of this function is a constant. It implies the gradient has no relationship with x. It is a constant gradient and the descent is also going to be on a constant gradient. If there is an error in prediction, the changes made by back propagation are constant and independent of the input.
Considering multiple layer scenario, It does not matter how many layers we have, if all are linear in nature, the final activation function of the last layer is nothing but just a linear function of the input of the first layer. That means More than 1 layer or N layers can be replaced by a single layer.
Sigmoid functions are nonlinear functions which are very useful because it can be used in case of multiple classes problem as well as multiple layers of neurons can be used based on requirement. A sigmoid function is has a positive derivative at each point and it is a bounded differentiable real function that is defined for all real input values.Commonly used sigmoid functions are generally logistic or hyperbolic tangent functions. The sigmoid functions are preferred in backpropagation neural networks because it reduces the complications during the training phase.
In above graph, between x values -2 to 2, Y values are very steep. This means any small changes in the values of X in that region will cause values of Y to change significantly. This means the sigmoid function has a tendency to bring the Y values to either end of the curve. So this can be used for predicting data values which belong to a specific class using this property of the sigmoid function.
Another advantage of sigmoid activation function is, unlike the linear function, the output of the activation function is always going to be in range (0, 1) or (-1, 1) compared to (-∞, ∞) of a linear function. So we have our activations bound in a range.
Towards either end of the sigmoid function, the Y values tend to respond very less to changes in X. So the gradient at that region is going to be small. It gives rise to a problem of vanishing gradients. Which means Gradient is small or has vanished and cannot make a significant change because of the extremely small value. The network refuses to learn further or is drastically slow. The sigmoid function is very popular in classification problems and there is few workaround available to address the vanishing gradient Problem.
Binary sigmoid Function
Activation function which is a sigmoid function between 0 and 1 are used in the neural network and where the output values are either binary or vary from 0 to 1. It is also called as binary sigmoid or logistic sigmoid.
Bipolar Sigmoid Function
A logistic sigmoid function can be scaled to have any range of values which may be appropriate for a problem. The most common range is from -1 to 1. This is called bipolar sigmoid activation function.
We would discuss on few more activation functions in our next Article.