TechnologyResourcesCapital MarketsComing Soon
Back to BlogPractice Management

How Attorneys Should Evaluate LLM-Powered Tools

Mage
Mage TeamLegal AI Experts
|
·8 min read

Key Takeaways

  • The single most important evaluation step is running a real deal in parallel through the candidate tools and comparing deliverables against ground truth.
  • Vendor demos are engineered to win; real deals are designed to ship. The latter is what you should be evaluating on.
  • Trust posture (SOC 2 Type II, no-training, isolated infrastructure) gets locked in week one or the rollout dies in legal review.
  • The evaluator is the senior associate who'll be the daily user, not the innovation officer or IT.
  • Watch the vendor's response cadence on pilot feedback — it predicts post-deployment quality more reliably than any demo metric.

This is the practitioner's framework for evaluating LLM-powered legal AI tools. It is the buyer-side companion to our pillar buyer's guide, written for attorneys searching the question directly.

The core idea: vendor demos are engineered to win. Real deals are designed to ship. The only meaningful evaluation runs real deals.

The four pillars

Every serious evaluation grades candidate tools on four dimensions. Each one matters; one of them being weak is a dealbreaker.

1. Accuracy on real data

Demo data is curated. Real deal data is not. The first real deal almost always reveals an accuracy gap.

The right test: pick a recently-closed deal where you have ground truth (partner-reviewed memo, issues list, schedules). Run it through the candidate tools. Compare:

  • Recall: what fraction of real issues did the tool find?
  • Precision: what fraction of flagged items are real issues?
  • Citation quality: does each finding link to the source clause, with the right amendment if applicable?
  • Hard cases: amendment chains, jurisdictional carve-outs, custom indemnity, non-English contracts.

The bar should be at or above what a competent associate finds on first pass. We have written about our own accuracy methodology in How We Measure Accuracy; the bar should be that vendors can publish their methodology with the same transparency. Many cannot.

2. Workflow fit

A tool can be technically accurate and still wrong for the firm. Workflow fit decides daily-use adoption.

Questions that surface fit issues:

  • How does the tool ingest data rooms from the providers we actually use (Datasite, Intralinks, ShareFile, iManage, NetDocuments)?
  • Can the risk checklist be configured per deal, per practice group, per partner?
  • Does output match our firm's house style (memo voice, schedule format, redline conventions)?
  • How does the tool handle non-English contracts?
  • Can deliverables be firm-branded?

The tools that win on fit are the ones built around how M&A counsel already work, not the ones that ask the firm to change its process.

3. Trust posture

This is where buyers should be most aggressive. Privileged content makes it non-negotiable.

The minimum bar:

  • SOC 2 Type II report (Type I is point-in-time; Type II is operating effectiveness over time — insist on Type II)
  • Written no-training clause in the DPA, with penalties for breach specified
  • Minimum-required retention (days, not years)
  • Single-tenant or strongly isolated infrastructure
  • AES-256 at rest, TLS 1.3 in transit
  • MFA, SSO support (Okta, Azure AD, SAML 2.0)
  • Comprehensive audit logging
  • Documented incident response with notification timelines

Submit security questionnaires to all candidates in week one of the evaluation. The procurement timeline is gated by GC and privacy review. Running it sequentially after the demo is what costs firms months.

We document Mage's posture on the security page. The bar should be that any vendor under consideration can answer with the same level of specificity.

4. Output quality

Output is what the partner sees. The bar: "partner edits the language, not the substance."

Concrete signals:

  • First-draft memos are the right length (one-pagers when the matter calls for it, depth when it doesn't)
  • Citations are precise (every finding traces to the exact clause in the exact document)
  • Voice matches firm conventions (customizable templates available)
  • Findings ordered by severity, not document order
  • The tool says "I don't know" when it doesn't (the worst failure mode is false confidence)

A useful test: ask the tool a question it cannot reasonably know (e.g., "did the counterparty have prior dealings with the seller's parent company?"). A serious tool says it cannot answer from the data room. A weak one fabricates a confident answer.

The questions vendors hate

A few diagnostics reliably separate serious vendors from less serious ones. Ask all of these in the first two calls.

  1. "Show me your accuracy methodology." If the answer is "we are best-in-class" rather than a documented methodology with metrics and willingness to publish, that is a signal.
  2. "How do you handle amendment chains?" Most generic tools and many self-described legal AI tools cannot do this well. The right answer involves specific architecture, not "our LLM understands context." See Amendment Chain Resolution: The Hardest Problem in Legal AI.
  3. "What's your hallucination rate, and how do you measure it?" Vendors who say "we don't hallucinate" are not telling the truth. The question is what the rate is and what architecture keeps it low. See LLM Hallucination in Contract Analysis.
  4. "Do you train on customer data, ever?" The answer should be no, with the DPA to back it up.
  5. "Where does my data sit, and who has access?" Single-tenant vs. multi-tenant, geographic location, employee access controls.
  6. "What does a real deal look like with your tool?" Walk-through of an actual deal workflow, not a feature demo.
  7. "What happens when you're wrong?" Vendors who can describe failure modes credibly are usually the ones whose products are stronger.
  8. "Can I talk to a customer using you on M&A specifically?" Reference calls with named customers in the same use case are gold.

Procurement traps to avoid

A few practical ones:

  • Don't price by seat without volume tiers. M&A teams have spiky utilization. Lump-sum or per-deal pricing matches the actual usage curve.
  • Watch the data-residency clause. "Hosted in AWS" is not enough. Which region? Which controls?
  • Get the DPA reviewed by privacy counsel. Generic vendor DPAs are written for SaaS, not for legal AI. Provisions on training, retention, and sub-processors should be reviewed and negotiated.
  • Negotiate exit clauses. Data export format, deletion timeline, certification of deletion.
  • Avoid multi-year lock-in early. A one-year contract with renewal is much better than a three-year contract for a tool the firm has used for two months.

Watch the vendor's response cadence

This is the underappreciated signal. During the pilot, you will give the vendor feedback (issues, feature requests, concerns). The cadence and quality of their response is the leading indicator of how the partnership goes once the deal is live.

Vendors that ship against feedback in days, not weeks, are vendors who know their product is judged on the chassis. Vendors that promise and don't deliver, or who deflect with "we'll add it to the roadmap" without dates, are vendors whose post-deployment experience will frustrate the team.

You are not just evaluating the tool. You are evaluating the team behind it.

The companion reading

If you want to see Mage as part of an evaluation, request a demo. Bring a real deal. We will run end-to-end diligence on it and walk you through the result against your manual workproduct.

Frequently Asked Questions

How long should an evaluation actually take?

90 days, structured around two real deals (one buy-side, one sell-side). Anything shorter is a vendor demo extended; anything longer is procurement dragging its feet. We laid out the rollout framework in [How to Roll Out Legal AI at a Law Firm](/blog/how-to-roll-out-legal-ai-at-a-law-firm).

Who in the firm should own the evaluation?

An M&A partner with credibility paired with a senior associate who will be the daily user. The partner sets the bar for output quality. The associate evaluates fit with the actual workflow. Innovation officers and IT can support but should not lead — they are not the users.

What's the most common procurement mistake?

Buying based on a vendor demo and then discovering on the first real deal that the accuracy on the firm's actual document mix is much lower than on the curated demo set. The countermeasure is evaluating on real, representative data before signing.

Should we run multiple tools in parallel after selecting one?

No. Running parallel tools in production multiplies the risk surface (privacy, retention, version control on the deliverables) without a corresponding benefit. Pick one, integrate it, train the team.

How do we negotiate the contract?

Avoid multi-year lock-ins early. A one-year contract with renewal is far better than a three-year contract for a tool the firm has used for two months. Negotiate exit clauses (data export, deletion timeline, certification) and DPA specifics (training, retention, sub-processors) with privacy counsel.

legal AI evaluationvendor selectionlaw firm procurementAEO

Ready to transform your M&A due diligence?

See how Mage can help your legal team work faster and more accurately.

Request a Demo

Related Articles