Consider the input sequence "The cat sat on the mat". A transformer first converts each token into a vector representation. Attention then determines which other tokens each token should focus on.
We will follow the token "cat" and show how it attends to the rest of the sequence.
For each token, the model computes three vectors: query (Q), key (K), and value (V). The query of one token is compared with the keys of other tokens to decide how much attention to pay to them.
These numbers are illustrative. They show the structure of the computation, not values from a trained model.
The query vector for "cat" is compared with every key vector. Larger similarity means stronger attention. After normalization, this becomes a set of attention weights.
In this example, "cat" pays the most attention to "sat" and "mat".
In standard dense attention, "cat" can compare itself with every token. The output vector is the weighted sum of all value vectors.
In an efficient attention mechanism, "cat" may only attend to a restricted set of tokens. Here, we keep only The, cat, sat, and mat, then renormalize the weights.
The two attention mechanisms produce similar but not identical contextualized vectors. The difference comes from which value vectors were allowed to contribute.
Dense attention preserves weaker contributions from every token, while sparse attention keeps only selected interactions. That makes the sparse output cheaper to compute, but it can shift the final contextual representation.