Attention January 15, 2024

Self-Attention Mechanism

Visual walkthrough of single-headed and multi-headed self-attention — understanding the matrix dimensions and operations.

#attention #transformers #deep-learning

Single Headed Attention

Single-headed attention mechanism diagram.

Ignore the Softmax operation and normalize by dividing by the square root of d_model, because these operations do not affect the dimensions of the matrices involved.

Multi-Headed Attention

Multi-headed attention mechanism diagram.

Ignore the Softmax operation and normalize by dividing by the square root of d_model, because these operations do not affect the dimensions of the matrices involved.