Last Updated on 2024-06-06 by Clay
Introduction
SiLU (Sigmoid Linear Unit) activation function is similar to Swish function, Swish just have additional trainable beta parameter. Many large language model (LLM) also adopt this approach, primarily in some exploratory models that use activation functions other than ReLU, such as the classic Llama architecture.
I show the formulas below, their difference is very obviously.
Note: Sigmoid function is:
The SiLU function approaches x as the input x increases, and approaches 0 as the input decreases. Intuitively, the curve of the SiLU function is very smooth. The function's output changes continuously with the input and has a derivative, which makes it very effective when using gradient descent algorithms since the changes in the derivative are not abrupt.
Example
import torch
import matplotlib.pyplot as plt
# Define the SiLU activation function
def my_silu(x: torch.Tensor) -> torch.Tensor:
return x * torch.sigmoid(x)
# Generate an input tensor from -10 to 10 with 100 points
x_values = torch.linspace(-10, 10, 100)
# Apply the SiLU function to the input tensor
y_values = my_silu(x_values)
# Plotting the function
plt.figure(figsize=(10, 5))
plt.plot(x_values.numpy(), y_values.numpy(), label="SiLU Activation")
plt.title("SiLU Activation Function")
plt.xlabel("Input value (x)")
plt.ylabel("Activated value (y)")
plt.legend()
plt.grid(True)
plt.show()
Output:
References
- SiLU — PyTorch 2.3 documentation
- SiLU Explained
- https://huggingface.co/docs/transformers/main/en/model_doc/llama