Skip to content

[Machine Learning] Note Of SiLU Activation Function

Last Updated on 2024-06-06 by Clay

Introduction

SiLU (Sigmoid Linear Unit) activation function is similar to Swish function, Swish just have additional trainable beta parameter. Many large language model (LLM) also adopt this approach, primarily in some exploratory models that use activation functions other than ReLU, such as the classic Llama architecture.

I show the formulas below, their difference is very obviously.


Note: Sigmoid function is:


The SiLU function approaches x as the input x increases, and approaches 0 as the input decreases. Intuitively, the curve of the SiLU function is very smooth. The function's output changes continuously with the input and has a derivative, which makes it very effective when using gradient descent algorithms since the changes in the derivative are not abrupt.


Example

import torch
import matplotlib.pyplot as plt

# Define the SiLU activation function
def my_silu(x: torch.Tensor) -> torch.Tensor:
    return x * torch.sigmoid(x)

# Generate an input tensor from -10 to 10 with 100 points
x_values = torch.linspace(-10, 10, 100)

# Apply the SiLU function to the input tensor
y_values = my_silu(x_values)

# Plotting the function
plt.figure(figsize=(10, 5))
plt.plot(x_values.numpy(), y_values.numpy(), label="SiLU Activation")
plt.title("SiLU Activation Function")
plt.xlabel("Input value (x)")
plt.ylabel("Activated value (y)")
plt.legend()
plt.grid(True)
plt.show()


Output:


References


Read More

Leave a ReplyCancel reply

Exit mobile version