Types of Activation Functions
Types of Activation Functions
Activation functions play a crucial role in neural networks by determining the output of a neuron. They introduce non-linearity into the model, allowing the network to learn complex patterns. Below are some of the most commonly used activation functions in neural networks:
Sigmoid Activation Function
The sigmoid function is one of the oldest and most well-known activation functions. It maps input values to an output between 0 and 1, making it especially useful for binary classification problems.
sigmoid(x) = 1 / (1 + exp(-x))
- Range: (0, 1)
- Pros: Easy to understand and implement, good for probability outputs.
- Cons: Can lead to vanishing gradients, which slows down training in deep networks.
Hyperbolic Tangent (Tanh) Activation Function
The tanh function is similar to sigmoid but outputs values between -1 and 1. It is often preferred over sigmoid for hidden layers in neural networks because it is zero-centered, making optimization more efficient.
tanh(x) = (2 / (1 + exp(-2x))) - 1
- Range: (-1, 1)
- Pros: Zero-centered, which helps in faster convergence compared to sigmoid.
- Cons: Like sigmoid, it can also suffer from vanishing gradients.
Rectified Linear Unit (ReLU)
The ReLU function is widely used due to its simplicity and effectiveness, especially in deep neural networks. It outputs the input directly if it’s positive, and zero otherwise. This non-linearity helps the model learn complex patterns efficiently.
ReLU(x) = max(0, x)
- Range: [0, ∞)
- Pros: Efficient, reduces the likelihood of vanishing gradients, and speeds up training.
- Cons: Can cause “dying ReLU” problems, where neurons can get stuck and stop learning if they only output zeros.
Leaky Rectified Linear Unit (Leaky ReLU)
The Leaky ReLU is a variant of the ReLU that allows small negative values for inputs less than zero. This helps prevent the “dying ReLU” problem, allowing for some learning even when the input is negative.
Leaky ReLU(x) = max(αx, x), where α is a small constant
- Range: (-∞, ∞)
- Pros: Helps mitigate the dying ReLU problem by allowing small gradients when x is negative.
- Cons: The choice of α is hyperparameter-dependent, and too large a value of α can still lead to suboptimal performance.
Parametric Rectified Linear Unit (PReLU)
The PReLU is another variation of ReLU where the negative slope is learned during training, instead of being fixed as in Leaky ReLU. This makes it more flexible and allows it to adapt during training.
PReLU(x) = max(αx, x), where α is learned during training
- Range: (-∞, ∞)
- Pros: It is adaptive and allows the network to learn the best value of α for each neuron.
- Cons: More computationally expensive compared to other ReLU variants.
Softmax Activation Function
The Softmax function is commonly used in the output layer of a neural network for multi-class classification problems. It converts raw scores (logits) into probabilities by taking the exponentials of each input and normalizing them.
Softmax(x_i) = exp(x_i) / sum(exp(x_j)) for all j
- Range: (0, 1) for each output, and the sum of all outputs equals 1.
- Pros: Converts outputs into probabilities, making it ideal for multi-class classification.
- Cons: Sensitive to extreme values, and the output might be skewed if one class dominates the others.
Swish Activation Function
The Swish function is a newer activation function introduced by researchers at Google. It is a self-gated activation function that has shown to outperform ReLU in certain scenarios.
Swish(x) = x / (1 + exp(-x))
- Range: (-∞, ∞)
- Pros: Smooth, non-monotonic function that can help improve model performance in certain deep learning tasks.
- Cons: Requires more computation compared to ReLU.
Activation functions are essential in helping neural networks model complex patterns. The choice of activation function can significantly affect the performance of the model. Some functions, like ReLU, are great for training deep networks, while others, like sigmoid or softmax, are better suited for specific tasks like binary classification or multi-class classification.