What Enables In-Context Learning In Transformers?

An Analysis of Attention Mechanisms

Quick Navigation


Overview

Large language models (LLMs) often demonstrate in-context learning (ICL) which is the ability to infer a task from examples provided in the prompt without updating model parameters. Recent theoretical work suggests that attention mechanisms may implicitly implement optimization procedures such as gradient descent when solving simple tasks like linear regression.

In this project, we investigate whether this behavior depends on the specific structure of the attention mechanism. We compare several attention variants, including standard softmax attention, linear attention, sparse attention, grouped-query attention, low-rank attention, and gated attention, under the same experimental setup.

Our goal is to understand the tradeoff between efficiency and learning capability in transformer architectures.


Introduction

What is In-Context Learning?

In-context learning is an interesting phenomenon that describes the ability of LLMs (transformers) to learn an input-output mapping from examples provided within the prompt.

For example, suppose a prompt contains several input-output examples of a function:

x₁ → y₁
x₂ → y₂
x₃ → y₃

followed by a new input: x*

The model must predict the corresponding output: y*

Despite never seeing this exact task before, large transformer models can often infer the relationship between inputs and outputs from the examples and apply it to the query. This behavior is surprising because the model is not updating its weights. Instead, it appears to adapt purely through the computation performed during the forward pass. In other words, the model is effectively learning at inference time from the prompt itself.

What is Attention?

Attention is the core mechanism that allows transformers to determine which parts of an input sequence are relevant when computing a representation for each token.

Instead of processing tokens strictly from left to right, attention allows each token in the sequence to look at other tokens and determine how relevant they are.

In practice, attention works by computing similarity scores between tokens using three learned vectors:

The similarity between queries and keys determines how strongly information from one token should influence another. The final representation of a token is therefore a weighted combination of information from other tokens in the sequence.

This mechanism is particularly important for ICL, because it allows the model to read example input–output pairs in the prompt and combine them to infer the relationship needed to solve the query task.

Problem Statement

Previous research has shown that these models are able to perform in-context learning due to their attention mechanism. At a high level, attention allows transformers to selectively relate information across tokens, which is a key part of how they form useful contextual representations.

However, one of the main drawbacks of standard attention is that it grows quadratically as the input size increases. For example, take a prompt of 100 tokens, this would require 10,000 “attention connections”. This was one of the main limitations with the first iteration of LLMs, the size of the prompt you could give it was capped.

In order to address this, researchers came up with more efficient attention mechanisms that didn’t grow quadratically with the input size. Although they were more efficient, with less “attention connections,” they weren’t as powerful.

This begs the question: Do these more efficient attention mechanisms still support the same level of in-context learning as standard attention? What are the tradeoffs of using more efficient attention mechanisms?

Why this matters: Efficient attention can reduce compute and memory costs, but it may also change how well a model learns from examples in the prompt. Our goal is to understand that tradeoff.

Target Users & Stakeholders


Research Question

Our central research question is:

How does the structure of the attention mechanism affect a transformer’s ability to perform in-context learning?

Specifically, we investigate whether different attention mechanisms:

To investigate this question, we compare several alternative attention mechanisms under the same experimental setup.


Methods

Attention Variants

We compare six attention mechanisms in the same transformer setup:

Standard

Softmax baseline
SoftmaxYes
Global contextYes
Memory costHigh
Main ideaFull attention

Baseline transformer attention with unrestricted token-to-token interaction.

GQA

Shared KV heads
SoftmaxYes
Global contextYes
Memory costMedium
Main ideaShared KV

Reduces cost by letting multiple query heads share key and value projections.

Sparse

Restricted connectivity
SoftmaxUsually yes
Global contextLimited
Memory costLow
Main ideaFewer links

Only a subset of token pairs interact, improving efficiency but reducing coverage.

Linear

No softmax
SoftmaxNo
Global contextYes
Memory costLow
Main ideaLinear update

Removes softmax and is closely connected to one-step gradient descent interpretations of ICL.

Low-Rank

Compressed attention
SoftmaxYes
Global contextCompressed
Memory costLow
Main ideaProjection

Approximates full attention using a lower-dimensional summary of keys and values.

Gated

Learned update control
SoftmaxNo
Global contextYes
Memory costLow
Main ideaSelective update

Uses a learned gate to control how strongly new information changes the representation.

More concretely, the following table identifies when and why to use each mechanism over others

Variant Main idea Why include it?
Standard Full softmax attention Baseline
GQA Shared KV heads Efficiency
Sparse Limited connectivity Scalability
Linear No softmax GD connection
Low-Rank Compressed attention Approximation
Gated Learned update control Flexible dynamics

Experiments

Four controlled sweeps are run on synthetic in‑context linear regression to isolate how attention structure shapes in‑context learning. All attention variants are evaluated under a matched training setup and the same evaluation metrics.

Common setup

1) Training‑Steps Sweep (Learning Curve)

2) Layers Sweep (Depth Scaling)

3) Context Sweep (Trained)

4) Context Sweep (Zero‑Train)

Data Generation

Each task samples a fresh ground‑truth linear map W*, draws inputs x from a zero‑mean distribution, and defines outputs y = W* x. A prompt is constructed from several context pairs (x_i, y_i) plus a query input x*. The query token is represented as (x*, 0), and the model must predict y*.

A new regression task is sampled for every training example, preventing memorization and forcing learning from the prompt itself. The setup remains simple enough to compare against analytic baselines (least squares and one‑step GD) while still capturing the core in‑context learning structure.


Results


Key Takeaways


Impact and Implications

Project Scope & Limitations

Within this project, we compare standard, grouped-query, sparse, linear, low-rank, and gated attention mechanisms under the same experimental setup.

Each model is trained and evaluated on synthetic linear regression tasks, which allows us to directly compare model behavior with classical optimization methods such as least squares, LASSO, and gradient descent.

However, this setup also introduces several limitations.

First, our experiments are conducted on synthetic tasks rather than natural language data, meaning our results may not directly transfer to large-scale language modeling settings.

Second, our analysis is primarily empirical. While we measure prediction accuracy and alignment with gradient descent updates, we do not analyze model parameters or derive formal theoretical guarantees explaining the observed behaviors.

Despite these limitations, this controlled setup allows us to isolate the role of the attention mechanism and study how architectural differences influence in-context learning behavior.

Our results are purely experimental, we do not analyze model parameters or have any mathematical arguments to justify our observations.

Applications and Practical Impact

In‑context learning enables models to adapt at inference time without retraining. This is essential for few‑shot classification, rapid domain adaptation, retrieval‑augmented QA, tool use, code completion with evolving APIs, and personalized assistants that pick up user preferences from short prompts.

Efficient attention variants matter because these applications benefit most from longer prompts, more examples, more documents, and richer context. If an attention mechanism preserves ICL while reducing memory and compute, it expands the usable context window and makes prompt‑driven behavior more scalable and cost‑effective.


Team