Attention Slideshow

Slide 1 — Tokens in the Prompt

Consider the input sequence "The cat sat on the mat". A transformer first converts each token into a vector representation. Attention then determines which other tokens each token should focus on.

Slide 2 — Query, Key, and Value Vectors

For each token, the model computes three vectors: query (Q), key (K), and value (V). The query of one token is compared with the keys of other tokens to decide how much attention to pay to them.

Token

Query (Q)

Key (K)

Value (V)

The

[0.2, 0.1, 0.0]

[0.1, 0.3, 0.1]

[0.2, 0.1, 0.4]

cat

[0.8, 0.4, 0.2]

[0.7, 0.5, 0.1]

[0.9, 0.2, 0.3]

sat

[0.6, 0.5, 0.1]

[0.8, 0.4, 0.2]

[0.7, 0.6, 0.2]

on

[0.1, 0.7, 0.3]

[0.2, 0.8, 0.2]

[0.1, 0.6, 0.3]

the

[0.2, 0.2, 0.1]

[0.2, 0.3, 0.2]

[0.3, 0.2, 0.2]

mat

[0.5, 0.6, 0.2]

[0.6, 0.7, 0.1]

[0.8, 0.5, 0.4]

These numbers are illustrative. They show the structure of the computation, not values from a trained model.

Slide 3 — Compute Attention Scores for "cat"

The query vector for "cat" is compared with every key vector. Larger similarity means stronger attention. After normalization, this becomes a set of attention weights.

Dense attention weights from "cat"

cat → The

0.08

cat → cat

0.18

cat → sat

0.27

cat → on

0.10

cat → the

0.07

cat → mat

0.30

In this example, "cat" pays the most attention to "sat" and "mat".

Slide 4 — Dense Attention Output

In standard dense attention, "cat" can compare itself with every token. The output vector is the weighted sum of all value vectors.

Dense weights used

0.08·V(The) + 0.18·V(cat) + 0.27·V(sat) + 0.10·V(on) + 0.07·V(the) + 0.30·V(mat)

Output_dense(cat)

[0.638, 0.430, 0.304]

Interpretation

The output keeps some information from all tokens, including "on" and the second "the".

Slide 5 — Sparse / Efficient Attention Output

In an efficient attention mechanism, "cat" may only attend to a restricted set of tokens. Here, we keep only The, cat, sat, and mat, then renormalize the weights.

Sparse weights used

0.14·V(The) + 0.21·V(cat) + 0.31·V(sat) + 0.34·V(mat)

Output_sparse(cat)

[0.706, 0.412, 0.317]

Interpretation

The output is more concentrated on "sat" and "mat", but it drops the contributions from "on" and the second "the".

Slide 6 — Compare Dense vs Sparse Output Vectors

The two attention mechanisms produce similar but not identical contextualized vectors. The difference comes from which value vectors were allowed to contribute.

Dense attention

[0.638, 0.430, 0.304]

Sparse attention

[0.706, 0.412, 0.317]

Difference (sparse − dense)

[+0.068, −0.018, +0.013]

Dimension

Dense

Sparse

Change

1

0.638

0.706

+0.068

2

0.430

0.412

−0.018

3

0.304

0.317

+0.013

Takeaway

Dense attention preserves weaker contributions from every token, while sparse attention keeps only selected interactions. That makes the sparse output cheaper to compute, but it can shift the final contextual representation.