Slide 1 — Tokens in the Prompt

Consider the input sequence "The cat sat on the mat". A transformer first converts each token into a vector representation. Attention then determines which other tokens each token should focus on.

Slide 2 — Query, Key, and Value Vectors

For each token, the model computes three vectors: query (Q), key (K), and value (V). The query of one token is compared with the keys of other tokens to decide how much attention to pay to them.

Slide 3 — Compute Attention Scores for "cat"

The query vector for "cat" is compared with every key vector. Larger similarity means stronger attention. After normalization, this becomes a set of attention weights.

Slide 4 — Dense Attention Output

In standard dense attention, "cat" can compare itself with every token. The output vector is the weighted sum of all value vectors.

Dense weights used
0.08·V(The) + 0.18·V(cat) + 0.27·V(sat) + 0.10·V(on) + 0.07·V(the) + 0.30·V(mat)
Outputdense(cat)
[0.638, 0.430, 0.304]
Interpretation
The output keeps some information from all tokens, including "on" and the second "the".

Slide 5 — Sparse / Efficient Attention Output

In an efficient attention mechanism, "cat" may only attend to a restricted set of tokens. Here, we keep only The, cat, sat, and mat, then renormalize the weights.

Sparse weights used
0.14·V(The) + 0.21·V(cat) + 0.31·V(sat) + 0.34·V(mat)
Outputsparse(cat)
[0.706, 0.412, 0.317]
Interpretation
The output is more concentrated on "sat" and "mat", but it drops the contributions from "on" and the second "the".

Slide 6 — Compare Dense vs Sparse Output Vectors

The two attention mechanisms produce similar but not identical contextualized vectors. The difference comes from which value vectors were allowed to contribute.