Understanding False Positive Rates in AI Contract Review

When evaluating legal AI tools, accuracy metrics matter. But not all accuracy is equal. Understanding the difference between precision and recall helps you evaluate whether a tool will save you time or create more work.

What Are False Positives?

In AI contract review, a false positive occurs when the system flags a contract provision that does not actually contain what was searched for. For example, if you ask the system to find change-of-control clauses and it returns a standard assignment provision that has no COC trigger, that is a false positive.

True Positive

The system correctly identifies a provision that matches what you searched for. This is the ideal outcome.

False Positive

The system flags a provision as a match when it is not. You waste time reviewing irrelevant results.

False Negative

The system misses a provision that should have been flagged. This is the most dangerous error type in legal review.

True Negative

The system correctly ignores a provision that does not match. This is the expected behavior for most contract text.

Why False Positives Matter in Legal AI

In contract review, false positives and false negatives have very different costs. A false negative (missing a critical provision) could expose the client to undisclosed risk. A false positive (flagging something that is not there) wastes attorney time but does not create legal risk. Most legal AI systems are calibrated to minimize false negatives, even at the cost of higher false positive rates.

The tradeoff

Setting the sensitivity too high catches everything but creates a flood of irrelevant results that attorneys must manually triage. Setting it too low reduces noise but risks missing provisions. The goal is to find the threshold where recall is high enough to be trustworthy while keeping precision high enough to be usable.

High false positive rate (low precision)

Attorneys spend hours reviewing irrelevant results
Team loses confidence in the tool and reverts to manual review
The AI creates more work than it saves

Low false positive rate (high precision)

Results are actionable without extensive filtering
Attorneys trust the output and adopt the tool for future deals
Time savings compound across every project

How Mage Minimizes False Positives

At Mage, we take a multi-stage approach to extraction that achieves high recall while keeping false positives low. Each stage acts as a filter, progressively refining results.

Broad initial extraction

The first pass uses high-recall models that cast a wide net. The goal at this stage is to never miss a relevant provision, even if it means capturing some irrelevant ones.

Validation layer

A second model reviews each extracted provision against the original query intent. Results that do not genuinely match are filtered out, reducing noise without losing true positives.

Confidence scoring

Each surviving result receives a confidence score. High-confidence results appear at the top of the matrix; low-confidence results are grouped separately for optional review.

Source linking

Every result links to the source text in the original document. Attorneys can verify any extraction with one click, making false positives easy to identify and dismiss.

Measuring Accuracy

When evaluating any AI contract review tool, ask for specific metrics rather than general accuracy claims. The metrics that matter for legal review are:

Precision

Of the provisions the system flagged, what percentage were actually relevant?

Why it matters: High precision means less noise. You spend less time dismissing irrelevant results.

Recall

Of all the relevant provisions in the data room, what percentage did the system find?

Why it matters: High recall means fewer missed provisions. This is the most important metric for legal risk.

F1 Score

The harmonic mean of precision and recall. Balances both metrics into a single number.

Why it matters: Useful for overall comparison, but can mask tradeoffs. Always ask for precision and recall separately.

What to ask vendors: Request precision and recall metrics broken down by provision type (change of control, assignment, indemnification, etc.), measured against attorney-reviewed ground truth data sets. Aggregate accuracy numbers can be misleading if they are dominated by easy-to-extract provisions.

Related Resources

Validation Layer

How Mage ensures precision

Review 300 Contracts in Under an Hour

The complete workflow guide