Equation:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$Self-Attention Calculation:
Code Example:
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1))
/ torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
attention_weights = F.softmax(scores, dim=-1)
return torch.matmul(attention_weights, V)
Equation:
$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O $$Equation:
$$ PE_{(pos, 2i)} = \sin \left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos \left(\frac{pos}{10000^{2i/d_{model}}}\right) $$Component | Description |
---|---|
Encoder | Encodes the input sequence |
Decoder | Generates the output sequence |
Multi-Head Attention | Captures multiple aspects of relationships |
Feed-Forward Network | Applies non-linearity |
Positional Encoding | Adds positional information to embeddings |
import torch.nn as nn
class TransformerBlock(nn.Module):
def __init__(self, d_model, nhead):
super(TransformerBlock, self).__init__()
self.attention = nn.MultiheadAttention(d_model, nhead)
self.ffn = nn.Sequential(
nn.Linear(d_model, 2048),
nn.ReLU(),
nn.Linear(2048, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x):
attn_output, _ = self.attention(x, x, x)
x = self.norm1(x + attn_output)
ffn_output = self.ffn(x)
x = self.norm2(x + ffn_output)
return x