Understanding False Positive Rates in AI Contract Review
When evaluating legal AI tools, accuracy metrics matter. But not all accuracy is equal. Understanding the difference between precision and recall helps you evaluate whether a tool will save you time or create more work.
What Are False Positives?
In AI contract review, a false positive occurs when the system flags a contract provision that does not actually contain what was searched for. For example, if you ask the system to find change-of-control clauses and it returns a standard assignment provision that has no COC trigger, that is a false positive.
True Positive
The system correctly identifies a provision that matches what you searched for. This is the ideal outcome.
False Positive
The system flags a provision as a match when it is not. You waste time reviewing irrelevant results.
False Negative
The system misses a provision that should have been flagged. This is the most dangerous error type in legal review.
True Negative
The system correctly ignores a provision that does not match. This is the expected behavior for most contract text.
Why False Positives Matter in Legal AI
In contract review, false positives and false negatives have very different costs. A false negative (missing a critical provision) could expose the client to undisclosed risk. A false positive (flagging something that is not there) wastes attorney time but does not create legal risk. Most legal AI systems are calibrated to minimize false negatives, even at the cost of higher false positive rates.
The tradeoff
Setting the sensitivity too high catches everything but creates a flood of irrelevant results that attorneys must manually triage. Setting it too low reduces noise but risks missing provisions. The goal is to find the threshold where recall is high enough to be trustworthy while keeping precision high enough to be usable.
High false positive rate (low precision)
- Attorneys spend hours reviewing irrelevant results
- Team loses confidence in the tool and reverts to manual review
- The AI creates more work than it saves
Low false positive rate (high precision)
- Results are actionable without extensive filtering
- Attorneys trust the output and adopt the tool for future deals
- Time savings compound across every project
How Mage Minimizes False Positives
At Mage, we take a multi-stage approach to extraction that achieves high recall while keeping false positives low. Each stage acts as a filter, progressively refining results.
Broad initial extraction
The first pass uses high-recall models that cast a wide net. The goal at this stage is to never miss a relevant provision, even if it means capturing some irrelevant ones.
Validation layer
A second model reviews each extracted provision against the original query intent. Results that do not genuinely match are filtered out, reducing noise without losing true positives.
Confidence scoring
Each surviving result receives a confidence score. High-confidence results appear at the top of the matrix; low-confidence results are grouped separately for optional review.
Source linking
Every result links to the source text in the original document. Attorneys can verify any extraction with one click, making false positives easy to identify and dismiss.
Measuring Accuracy
When evaluating any AI contract review tool, ask for specific metrics rather than general accuracy claims. The metrics that matter for legal review are:
Precision
Of the provisions the system flagged, what percentage were actually relevant?
Why it matters: High precision means less noise. You spend less time dismissing irrelevant results.
Recall
Of all the relevant provisions in the data room, what percentage did the system find?
Why it matters: High recall means fewer missed provisions. This is the most important metric for legal risk.
F1 Score
The harmonic mean of precision and recall. Balances both metrics into a single number.
Why it matters: Useful for overall comparison, but can mask tradeoffs. Always ask for precision and recall separately.
What to ask vendors: Request precision and recall metrics broken down by provision type (change of control, assignment, indemnification, etc.), measured against attorney-reviewed ground truth data sets. Aggregate accuracy numbers can be misleading if they are dominated by easy-to-extract provisions.