TechnologyResourcesCompany
Back to BlogTechnical Deep Dive

Our Accuracy Methodology: How Mage Validates Extraction Quality

Raffi Isanians
Raffi IsaniansCEO & Co-founder
|
December 10, 2025·7 min read

Before Mage, I spent years at Google working on Document AI, where I learned firsthand that document extraction is one of the hardest problems in machine learning. The gap between "impressive demo" and "production-ready accuracy" is enormous. Today, I want to pull back the curtain on how we approach accuracy at Mage, and why we believe our methodology produces results that attorneys can actually rely on.

The Dirty Secret of AI Document Extraction

The document intelligence industry has a credibility problem. Vendors routinely claim "99% accuracy," but that number usually refers to character-level OCR on clean, printed documents. The metric that actually matters for legal work is field-level accuracy: did the AI correctly extract the complete indemnification cap, including all its conditions and exceptions?

In 2025, enterprise document extraction typically hovers between 85% and 95% accuracy for complex, unstructured documents without human review. For legal documents with their nested clauses, cross-references, and context-dependent interpretations, the challenge is even greater.

At Mage, we refuse to ship "AI slop." Our attorneys deserve extractions they can trust, not outputs that require constant second-guessing. That commitment to quality drove us to develop a fundamentally different approach.

Principle 1: Paragraph-Level Processing

The first breakthrough came from understanding a well-documented phenomenon in AI research called the "Lost in the Middle" problem. When you feed an LLM a long document, its ability to accurately retrieve and reason about information degrades significantly. Models exhibit a U-shaped performance curve: they handle the beginning and end of a context window well, but accuracy drops by 20% or more for information buried in the middle.

This has profound implications for legal document analysis. If you dump an entire 50-page contract into an LLM and ask it to extract the termination provisions, you are statistically likely to get incomplete or inaccurate results. The relevant clauses might be on page 23, right in the model's blind spot.

Our solution: we process documents paragraph by paragraph.

Instead of overwhelming the model with an entire document, we break contracts into their constituent parts and analyze each section with focused attention. This approach keeps context windows small and manageable, ensuring the AI can give each paragraph its full reasoning capacity. When we need to understand cross-references or related clauses, we intelligently pull in only the relevant context.

The result is dramatically better accuracy. By never asking the model to search through dozens of pages, we eliminate the "lost in the middle" failure mode entirely. Every extraction happens with the relevant text front and center in the model's attention.

Principle 2: Multi-Model Consensus

The second pillar of our accuracy methodology is something we call multi-model consensus. Instead of trusting a single AI model to get the right answer, we run multiple models simultaneously and compare their outputs.

This approach is grounded in a simple insight: different AI models have different failure modes. GPT-4o might hallucinate a date that Claude 3.5 Sonnet extracts correctly. A fine-tuned extraction model might miss nuance that a general-purpose model catches. By requiring agreement across multiple model families, we filter out the idiosyncratic errors that any single model would make.

Here is how it works in practice:

  1. Parallel Extraction: We send the same paragraph to multiple AI models from different providers (OpenAI, Anthropic, and others).
  2. Agreement Scoring: We calculate an agreement score based on whether the models extracted the same information.
  3. Consensus Resolution: When models agree, we have high confidence in the extraction. When they disagree, we flag the discrepancy for additional review.

Research in 2025 shows that multi-model consensus can reduce hallucinations by 40-60% and boost extraction accuracy by 4-6 percentage points. More importantly, it provides a reliable signal for confidence: if three distinct model families agree on a fact, the probability of hallucination drops to near zero.

This is fundamentally different from relying on a single model's "confidence score," which is notoriously unreliable. Models will hallucinate with 99% confidence. Agreement across independent models is a far more trustworthy signal.

Principle 3: Quality Over Speed

Running multiple models on every paragraph is computationally expensive. Processing documents paragraph-by-paragraph takes longer than dumping everything into a single prompt. We made these tradeoffs deliberately.

In M&A diligence, the cost of an error dwarfs the cost of compute. Missing a change-of-control provision or misreading an indemnification cap can have consequences measured in millions of dollars. Our attorneys need to trust what they are seeing, not treat every AI output as a rough draft that requires manual verification.

This is what we mean when we say we prioritize quality over "AI slop." The industry is full of products that optimize for impressive-looking outputs and fast response times. We optimize for accuracy. Our extraction pipeline takes the time to get it right, because that is what legal work demands.

Lessons from Document AI

My time at Google Document AI taught me that the hardest problems in document extraction are not the ones you expect. Character-level OCR is largely solved. The real challenges are:

  • Contextual understanding: Knowing that a "Termination Date" field in one section refers to a different concept than "Effective Date of Termination" in another
  • Complex tables: Handling merged cells, spanning headers, and tables that break across pages
  • The long tail: A system might achieve 99% accuracy on common document formats, but stumble on the unusual edge cases that matter most in legal review

These are exactly the problems where our paragraph-level, multi-model approach shines. By processing documents at a granular level with multiple independent reasoners, we catch the subtle errors that would slip through a more conventional pipeline.

The Road Ahead

We are not done improving. Our accuracy methodology continues to evolve as we learn from every document our attorneys process. We are investing in specialized models fine-tuned on legal language, better handling of cross-references and defined terms, and more sophisticated consensus algorithms.

But the core principles remain constant: keep context windows focused, require multi-model agreement, and never compromise on quality. These are the foundations that let us deliver AI extractions attorneys can actually trust.

In an industry awash with AI hype, we believe the path forward is rigorous methodology, not marketing claims. That is how we are building Mage, and that is how we intend to earn the trust of the legal profession.

See our accuracy methodology in action

Schedule a demo to see how Mage extracts data you can trust.

Request Demo