How We Test Legal AI Accuracy: Mage's Benchmarking Methodology
Key Takeaways
- •Accuracy claims without disclosed methodology are meaningless: the specific documents tested, the extraction tasks measured, and the evaluation criteria all determine whether an accuracy number reflects real-world performance
- •Mage benchmarks against expert human reviewers on the same document sets, measuring both extraction completeness and classification accuracy across document types that attorneys encounter on live deals
- •Confidence scoring separates high-certainty extractions from uncertain ones, allowing attorneys to focus human review time on the outputs most likely to need correction
- •Continuous regression testing against a growing corpus of annotated documents ensures that model improvements in one area do not degrade performance in another
Legal AI accuracy testing is the process of systematically measuring how well an AI system performs against defined standards on real legal documents. It is the foundation of trust in any AI-assisted review system. And it is where most vendors in the legal AI space fall short, because meaningful accuracy measurement is hard, expensive, and often produces numbers that are less impressive than marketing-friendly claims.
Why Most Accuracy Claims Are Unreliable
When a legal AI vendor claims "95% accuracy," the natural question is: 95% of what, on what documents, measured how?
The answer matters enormously. An accuracy claim tested on clean, well-structured contracts with standard language will produce a different number than one tested on scanned, multi-amendment lease agreements with handwritten riders. An extraction task that identifies whether a contract contains an indemnification clause produces a different accuracy number than one that extracts the specific indemnification cap amount and identifies all the carve-outs.
There are three problems with how accuracy is typically reported in legal AI:
Cherry-picked test sets. If you test only on document types where your system performs well, you get high numbers. If you test on the messy, ambiguous, multi-amendment documents that attorneys actually encounter, the numbers come down. Meaningful benchmarks must include the full distribution of document quality.
Vague task definitions. "Extracting the indemnification provision" could mean identifying that an indemnification section exists, extracting the complete text of the provision, or identifying the cap, basket, carve-outs, survival period, and notice requirements separately. Each represents a different level of difficulty and a different accuracy number.
No human baseline. An AI system that achieves 90% accuracy sounds impressive until you learn that an experienced attorney achieves 92% on the same task. Or it sounds unimpressive until you learn that the same experienced attorney achieves 85% after reviewing their 200th document in a week.
How Mage Approaches Accuracy Measurement
Our benchmarking methodology is built on three principles: representative test sets, defined extraction standards, and human baselines.
Representative Test Sets
Our test corpus includes documents sourced from real M&A data rooms across multiple industries, company sizes, and document quality levels. The corpus includes:
- Clean, well-structured contracts (word-processed, standard formatting)
- Scanned documents with OCR artifacts and inconsistent formatting
- Multi-amendment chains where the current terms require reading across the original agreement and multiple amendments
- Non-standard formats including letters, emails memorializing agreements, and handwritten modifications
- Multi-jurisdictional documents with varying legal terminology and conventions
We do not exclude difficult documents from our test sets. If attorneys encounter a document type on live deals, it belongs in the benchmark.
Defined Extraction Standards
For every document type and clause category we support, we maintain a detailed extraction specification that defines:
- What constitutes a correct extraction (exact match, semantic equivalence, or structured match)
- What constitutes a complete extraction (all relevant provisions identified versus only the primary instance)
- How multi-source answers are handled (provisions that span multiple sections or documents)
- How ambiguous provisions are scored (when reasonable attorneys would disagree on interpretation)
This specification is the contract between the engineering team and the accuracy measurement process. When we improve a model, we measure against the same specification. When we expand to new document types, we define the specification before we begin testing.
Human Baselines
Every test set is annotated by experienced M&A attorneys who establish the ground truth. We use multiple annotators and measure inter-annotator agreement to identify provisions where reasonable attorneys disagree. These disagreement cases are tracked separately because they represent the ceiling of achievable accuracy. No AI system should be penalized for uncertainty that experienced humans share.
We then compare the system's performance against the human annotations using standard information retrieval metrics:
- Precision: Of the provisions the system extracted, how many were correct?
- Recall: Of the provisions that should have been extracted, how many did the system find?
- F1 score: The harmonic mean of precision and recall, providing a single metric that balances both dimensions
We report these metrics at the document type and clause category level, not as a single aggregate number. An aggregate number obscures the distribution of performance across different task difficulties.
Confidence Scoring
Not all extractions carry the same level of certainty. A clearly stated indemnification cap of "$5,000,000" in a well-structured agreement is a different extraction challenge than a cap that is defined by cross-reference to another section that itself references a schedule.
Mage assigns a confidence score to every extraction. This score reflects:
- Language clarity: How unambiguous is the provision's language?
- Source consistency: Do multiple sections of the document support the same answer?
- Pattern familiarity: How similar is this provision to patterns the system has seen before?
- Amendment impact: Has the provision been modified by subsequent amendments?
Attorneys use confidence scores to prioritize their review. High-confidence extractions can be spot-checked. Low-confidence extractions get full attorney attention. This is a fundamentally different workflow than reviewing every extraction uniformly.
The practical result is that attorneys spend their time where it matters most: on the ambiguous provisions, the unusual language, and the multi-amendment chains where human judgment is genuinely needed.
Regression Testing and Continuous Improvement
AI systems do not stay fixed. Models are updated. New document types are added. Extraction logic is refined. Without systematic regression testing, an improvement in one area can quietly degrade performance in another.
Our regression testing framework runs the full benchmark suite against every model update before it reaches production. The process is automated and produces a comparison report showing:
- Performance changes by document type and clause category
- Any regressions that exceed defined thresholds
- New test cases added to the corpus since the last release
- Overall trend lines for precision, recall, and F1 across the benchmark
No model update ships to production with a regression that exceeds our threshold in any category. This is a deliberate trade-off: we move slower on updates, but we maintain consistency for the law firms and private equity teams that depend on our system for live deals.
What Accuracy Means in Practice
The goal of accuracy testing is not to produce an impressive number for a marketing slide. It is to give deal teams a reliable answer to a practical question: can I trust this system enough to change my workflow?
The honest answer is nuanced. For certain document types and extraction tasks, the system is reliable enough that attorneys can review by exception rather than reviewing every output. For other tasks, the system is a powerful accelerator that surfaces the right provisions for attorney review but requires human verification on every output.
Knowing the difference between those two categories, for every document type and clause category, is what rigorous benchmarking provides. It is also what separates tools that attorneys actually adopt from tools that get piloted and abandoned.
We built Mage's contract review and clause extraction capabilities with this philosophy: measure rigorously, report honestly, and let attorneys make informed decisions about how to use the tool in their workflow.
Frequently Asked Questions
How does Mage measure the accuracy of its legal AI?
Mage measures accuracy against expert human reviewers on curated test sets of real legal documents. The evaluation covers document classification accuracy, provision extraction completeness, and extraction correctness across every supported document type and clause category. Each test set is annotated by experienced M&A attorneys who establish the ground truth. The system's outputs are compared against these annotations using precision, recall, and F1 metrics at both the document and provision level.
What accuracy rate does Mage achieve on contract review?
Mage's accuracy varies by document type and extraction complexity, which is why we report accuracy per document category rather than a single aggregate number. Across our benchmark test sets, Mage consistently achieves precision and recall rates that exceed those of junior associate reviewers and approach those of mid-level associates on most document types. We publish category-level accuracy metrics and update them as our models improve.
How does AI accuracy compare to human reviewer accuracy?
Human reviewer accuracy varies significantly by experience level and fatigue. Studies of manual document review consistently show error rates of 15% to 30% among junior reviewers, with accuracy improving but never reaching 100% at the senior level. AI systems like Mage achieve consistent accuracy that does not degrade with volume or fatigue. The practical advantage is not perfection but consistency: the same document reviewed at the beginning or end of a data room receives the same level of attention.
What is confidence scoring in legal AI?
Confidence scoring assigns a reliability indicator to each extraction, reflecting the system's certainty about the accuracy of that specific output. High-confidence extractions can be reviewed quickly. Low-confidence extractions flag items where the system encountered ambiguity, unusual language, or contradictory provisions. This allows attorneys to allocate their review time efficiently, focusing human attention on the outputs most likely to need correction rather than reviewing every extraction uniformly.
Ready to transform your M&A due diligence?
See how Mage can help your legal team work faster and more accurately.
Request a DemoRelated Articles
LLM Hallucination in Contract Analysis: Why Source Verification Is Non-Negotiable
Large language models hallucinate. In legal contract analysis, a single fabricated clause citation can derail a deal. Here is how hallucination manifests in legal AI, why it happens, and how to build systems that prevent it.
Amendment Chain Resolution: The Hardest Problem in Legal AI
Why amendment chains break standard AI document analysis approaches, how structured extraction handles them, and what makes multi-amendment resolution the defining technical challenge for legal AI systems.
Why We Built a Legal Document Classifier First
Why Mage built document classification before extraction, how document types determine extraction strategy, and why getting classification right is the prerequisite for everything else in legal AI.