Doesn't a 1M-token context window solve everything?

No. It solves the problem of fitting more text in. It doesn't solve the problem of reasoning correctly about that text. Throwing 1,000 contracts into context and asking the model to find the change-of-control clauses produces output that looks plausible and contains errors that scale with the document count. The architecture that catches errors is what produces correct output.

What's the actual hard problem?

Reasoning about ordering, dependencies, and supersession across documents. Specifically for amendment chains: which version of a provision is currently operative? RAG and direct-context-stuffing both have no native handling for this. We covered the architectural fix in [Amendment Chain Resolution: The Hardest Problem in Legal AI](/blog/amendment-chain-resolution-hardest-problem-legal-ai).

When are big context windows useful?

Single-document long-form reasoning (analyzing a 200-page agreement holistically), pulling related provisions across a small set of documents (3-5), keeping the system prompt and the firm's voice in context. They're not the answer for amendment chains, large data rooms, or multi-document supersession reasoning.

What does Mage actually do?

Structured extraction with explicit sequencing. We process amendments in order, track add/modify/delete operations at the provision level, and produce a resolved current-state view. The result goes into context for downstream reasoning. The structuring is what makes the downstream reasoning correct.

Multi-Document Context Windows in Legal AI, Explained

This is a technical piece on why bigger context windows don't solve multi-document reasoning in legal AI, and what does. Written for engineers building legal AI internally and for attorneys who keep getting told by vendors "our long context window handles your data room."

The marketing claim

A common pitch in 2026: "Our model has a 1M+ token context window, so we can fit your entire data room in context and answer questions about it." The implication: context size obviates the need for architectural complexity.

Partially true. Mostly false.

What's true: bigger context windows do help on some tasks. Holistic analysis of a single long agreement (200-page master services agreement, complex financing instrument) becomes possible without chunking. System prompts, examples, and firm-voice context can stay in window without paginating.

What's false: bigger context windows don't solve the fundamental multi-document reasoning problem. Specifically, they don't solve amendment chain resolution, supersession reasoning, or cross-document consistency.

Why context size doesn't solve amendment chains

Throw a multi-amendment MSA (original + 8 amendments, ~500 pages total) into a 1M-token context. Ask: "what is the current operative termination provision?"

The model sees all of it. It also sees the original termination provision, the first amendment's modification of that provision, the third amendment's further modification, the fifth amendment's complete rewrite, the seventh amendment's dollar-figure adjustment to a referenced cap. Every version is in the haystack.

The model's job is to figure out which version is current. The architectural problem: the model has no native concept of "currently operative". It has access to all the text. It picks one version based on whichever signals dominate (textual recency in context, lexical similarity to the question, prompt phrasing). The pick may be right or wrong; the model is confident either way.

This is the failure mode we cover in Amendment Chain Resolution: The Hardest Problem in Legal AI. Bigger context windows make it worse, not better, because more text in context means more possible conflicting answers without structural disambiguation.

What works instead

The architectural fix is structured extraction with explicit sequencing.

For amendment chains specifically:

Process the original agreement. Extract the termination provision (and every other relevant provision). Tag it as the v1 state.
Process amendment 1 in context with the v1 state. Identify whether amendment 1 modifies, supersedes, or leaves intact each tracked provision. Update the tracked state to v2.
Process amendment 2 in context with the v2 state. Same operation. Produce v3.
Repeat through amendment N. The final state is the operative current view.
Pass the resolved current state to downstream reasoning. Memo drafting, schedule synthesis, redline review all see the resolved state, not the raw amendment stack.

The key insight: the work of "what's currently operative" is sequential and dependency-aware. Each amendment's effect is conditional on what came before. A model that sees all of it at once cannot reason correctly about the dependency structure; a model that processes the chain step-by-step with explicit state tracking can.

Other multi-document patterns big context can't solve

Amendment chains are the most common case. There are others:

Cross-referenced provisions. Contract A references "the indemnity package as set forth in Section 8 of the Master Services Agreement". The MSA is a separate document. Resolving the cross-reference requires identifying the right MSA in the data room (there may be multiple), pulling Section 8, and integrating it with Contract A's terms. Context-stuffing produces plausible-but-wrong answers when multiple MSAs are in the haystack.

Schedule-to-agreement linkage. Disclosure schedules reference specific provisions of the underlying SPA's reps. Reasoning about whether a schedule item is a complete answer to the rep requires structured handling of the rep-to-schedule mapping. Context size doesn't help.

Multi-document consistency. "Are all 47 customer contracts consistent on change-of-control?" This is a quantitative question. Context-stuffing 47 contracts and asking the model to count change-of-control variants produces answers that look plausible and miscount.

Long-tail prioritization. "Of these 1,500 documents, which 10 deserve the partner's attention?" Quality of prioritization depends on how the system has classified, scored, and ranked. The model can read the documents in context; the prioritization quality is in the chassis around the model.

The right framing

Big context windows are a tool. They help on some tasks. They don't make the architectural complexity disappear.

The right framing for engineers and attorneys evaluating tools: ask not "what's your context window?" but "how does your system handle multi-document reasoning?" The answers are very different and the second question is the one that matters.

Companion reading

Amendment Chain Resolution: The Hardest Problem in Legal AI — the full deep-dive on the canonical case
How to Architect a Document AI Pipeline for Legal — six-layer architecture overview
LLM Hallucination in Contract Analysis — how the validation layer catches the cases where reasoning fails
The F1 Engine Problem — the broader chassis-vs-engine framing

— Raffi

Multi-Document Context Windows in Legal AI, Explained

Key Takeaways

The marketing claim

Why context size doesn't solve amendment chains

What works instead

Other multi-document patterns big context can't solve

The right framing

Companion reading

Frequently Asked Questions

Doesn't a 1M-token context window solve everything?

What's the actual hard problem?

When are big context windows useful?

What does Mage actually do?

Ready to transform your M&A due diligence?

Related Articles

How to Architect a Document AI Pipeline for Legal

Amendment Chain Resolution: The Hardest Problem in Legal AI

Why Clause-Level Segmentation Changes Everything in Legal AI