# Activation Function in Deep Learning

## Introduction

Before starting Activation Function in Deep Learning you must go through my previous blog related to Deep Learning https://ainewgeneration.com/introduction-to-deep-learning/.In this particular blog we will focus on Activation Function with a variety of types in Deep Learning and their Importance in Neural Network The activation function introduces something called nonlinearity to a network and also determines whether a particular neuron can contribute to the next layer. Now let’s jump to more details and their types in deep learning.

## Table of Content

- What is Activation Function ?
- Types of Activation Function
- Step Function
- Linear Function
- Sigmoid Function
- Tanh Function
- ReLU Function (Rectifier Linear Unit)
- Leaky ReLU Function

- Which Activation Function to Use ?
- Why Non-Linear Activation Functions ?

## What is Activation Function ?

The activation function introduces something called non-linearity to a network, and also determines whether a particular neuron can contribute to the next layer. **But how do you decide if the neuron can fire/activate or not?** Well, we had a couple of ideas that led to the creation of different activation functions. we will discuss various types of activation in detail.

## Types of Activation Function

- Step Function
- Linear Function
- Sigmoid Function
- Tanh Function
- ReLU Function (Rectifier Linear Unit)
- Leaky ReLU Function

### Step Function

**Activate** If it is above a certain value or threshold. If it is less than the threshold, **Don’t Activate** it. The activation function is equal to activated if Y is greater than some threshold, else it’s not. This is actually a step function the step function gives output as 1 when a certain value is greater than some threshold here in the case is 0 and when the certain value is less than 1 then output is 0 it is deactivated.

**Drawbacks :**

The main drawback of a step function is as in the below image where you want to classify multiple such neurons into classes one, Classes two & Classes three, etc. so what will happen If more than one neuron is activated, all these neurons will output of 1. How do we have to decide which class it belongs to? It’s is really complicated, for us to classify because You would want the network to activate only one neuron, and the other should be 0. Only then we will be able to say it was classified properly

### Linear Function

Linear Function is a straight line function where the activation is proportional to the input by a value called the slope of the line. This way it gives us a range of activation, so it isn’t binary activation we can definitely connect the few neurons together, and if more than one fires we could take the maximum value and decide based on that.

**Drawbacks **:

If you are familiar with grading dissent, you’ll notice that the derivative of a linear function is constant. Makes sense because its slop is not changing at any point.

This means that the gradient has no relationship with x. There’s also means that during back-propagation the adjustments made to the weights and biases aren’t dependent on x at all. This is a real problem during training time when weights are adjusted in backpropagation. consider if we have a fully connected layer with numbers of hidden layers if all of the layers with activation function is linear then the Final fully connected layer is also linear some linear function of the input This means that the entire new network of dozens of layers can be replaced by a single layer as we know that a combination of linear functions in the manner is still another linear-function And this is terrible because we’ve just lost the ability stack layers this way no matter how much we stack the whole network still equivalent to a single layer with single activation function

### Sigmoid Activation

The sigmoid function is defined as a **fx=1/1+e−x ** its looks smooth and kind of like a step function let’s assume that it is non-linear in nature combination of non-linear function is also non-linear So now we can stack layers. **What about the non-binary activation?** yes, that too this function outputs analog activation unlike the step function and also has smooth gradient an advantage of this activation function is that unlike the linear function output of the function is going to be range from** { 0 -> 1} ** It is one of the widely used in activation function.

**Drawbacks : **

If you look closely in the above fig of sigmoid between X = -2 and X = 2, the Y values are very steep. Any small changes in values of X in that region will cause values of wide change drastically. Also, towards the end of the function, the Y values tend to respond very few to changes in X. The gradient at those regions is going to be really, really small, almost zero, and it gives rise to the **vanishing gradient problem. **We just say that if the input to the activation function is either large or small, the sigmoid going to squish that down to a value between zero and one and the gradient of this function becomes really small which is a huge problem.

### Tanh Function

Tanh Function looks very similar to sigmoid infect mathematically this is what’s known as a shifted sigmoid function. Like the sigmoid it has characteristics as we discuss above in the sigmoid function it is non-linear in nature so we can stack layers. it is bound to be range from {-1 to +1} The derivative of Tanh function is Stepper than of sigmoid. So deciding between Tanh & Sigmoid really depends upon your requirement of gradient descent. like sigmoid Tanh is also a very popular and widely used activation function

**Drawbacks : **

Tanh activation function also has Vanishing Gradient Problem like of Sigmoid had.

### ReLU Function (Rectifier Linear Unit)

The Rectifier Linear Unit (ReLU) or the value function is defined as an of **R(z) = max{0,z}** at first it would look like a linear function the graph is linear in the positive axis but ReLU is Infect non-linear in nature and combination of ReLU is also non-linear so this means we can stack layers as we discussed in two previous function range is bounded but in ReLU it is not bounded the range is from (**Zero to infinity**) this means that there is a chance of blowing up the activation.

**The sparsity of Activation **imagine a big neural network with lots of neurons using a sigmoid or Tanh will cause almost all the neurons to fire in an analog way this means that almost all activation will be processed to describe the network output in other words the activation will be dense and this is costly ideally we want only a few neurons in the network to activate and thereby making the activation pass efficient here were the ReLU comes in imagine a network with randomly initialized by weights and almost 50% of the network yield zero activation because of the characteristic ReLU it outputs 0 for negative of X this means that only 50% of the neurons will fire. Sparse Activation making the network lighter.

Because of the Horizontal line in ReLU for the negative value of X, the gradient is zero in that region which means during backpropagation the weights will not get adjusted during descent this means that those neurons which go into that state will stop responding to variations in the error simply because the gradient is zero nothing changes this is called the dying ReLU Problem this problem could cause several neurons to just die and not respond thus making a substantial part of the network passive rather then what we want to output.

### Leaky ReLU Function

We are adding a horizontal lines into the non-horizontal components by adding a slop usually slop is around 0.001 and this is a new version of ReLU is called **Leaky ReLU**.

The main idea is that the gradient should never be zero the one major advantage of ReLU is that it’s less computationally the function like Tanh & sigmoid because it involves simple mathematics operation this is a really good point to consider when you are designing your own deep neural network.

## Which Activation Function to Use ?

**Sigmoid Activation Function** is widely used as Binary Classification problem in output layer as its range the output between 0 to1 so its take the value which value is above the threshold it belongs to class A and which value is below the threshold it belongs to class B.

**ReLU Activation Function** is widely used in Hidden Layer of Neural Network as it deactivates the neurons with the negative value and Activates/fires the Neurons if the value is above 0.

## Why Non-Linear Activation functions ?

We saw in the introduction part as activation functions serve to introduce something called non-linearity in the network for all the intensive purposes introducing non-linearity simple means that your activation function must be non-linear that is not the straight line mathematically linear function are polynomials of degree 1 when graphed in x-y plane are straight-line inclined to the x-axis at a certain value we call this the slope of the line. non-linear functions are polynomials of degrees greater than 1 and when you graphed they don’t form a straight line rather than more curved.

If we use linear activation functions to model our data then no matter how many hidden layers our networks has it will always become equivalent to having a single-layer network and in deep learning, we want to be able to model every type of data without being restricted as would be the case should we use linear activation function.

## End Notes

I hope this blog related to activation function helps you to make a complete understanding of activation function in deep learning, in the next blog we will focus on Loss & optimizer in Deep Learning.