TechnologyResourcesCapital MarketsComing Soon
All blog postsGuide

Evaluating Legal AI Tools: A Buyer's Guide for M&A Counsel

Buyer's guide · 10 min read

Procurement of legal AI tools tends to fail in predictable ways. Firms either pick based on a vendor demo that is engineered to win, or they get stuck in a six-month evaluation cycle that ends with no decision. Neither is good for the M&A practice that needs the leverage.

This guide is a working framework for evaluating legal AI tools, focused on the M&A buyer specifically. It is the third installment of our pillar series; see Legal AI for M&A for the master hub and Legal AI vs. Harvey vs. Generic AI for the category landscape.

Frame the buy

Before any vendor calls, the M&A team should be clear on three things internally.

Scope. Is the tool meant to own the M&A workstream end-to-end, or is it complementary to a firm-wide assistant the firm has already deployed? The answer drives the shortlist. Specialists like Mage own the deal; firm-wide assistants like Harvey or Legora cover broader practices. Many firms run both. Decide which slot you are filling.

Volume. How many deals per year, what size, what mix of buy-side / sell-side / financing? A firm doing 30 mid-market deals annually has a different tool need than a firm doing 3 mega-deals annually. Volume drives the pricing model that will work and the level of customization that is worth paying for.

Quality bar. What does partner-grade output mean at this firm? "Memo we would send to a sophisticated client without rewriting" is one bar. "Issues list we would discuss internally before the partner drafts the memo" is another. Both are valid; the tool requirements are different.

The four evaluation pillars

Once the buy is framed, candidate tools get evaluated against four pillars.

Pillar 1: Accuracy on real, representative data

This is the pillar that matters most and is hardest to evaluate honestly. Vendor demos are curated. The first real deal almost always reveals an accuracy gap.

The right test: pick a recent deal you have closed where you have ground truth (the partner-reviewed memo, issues list, schedules). Run it through the candidate tools. Compare findings against ground truth on:

  • Recall (what fraction of real issues did the tool find?). The bar should be at or above what a competent associate finds on first pass.
  • Precision (what fraction of flagged items are real issues?). Below 70% precision and the partner reads everything anyway.
  • Quality of citation. Does each finding link to the source clause, with the right amendment? Or does the tool reference text that is no longer operative?
  • Behavior on hard cases: amendment chains, jurisdictional carve-outs, custom indemnity, non-English contracts.

A serious vendor will let you run this evaluation with anonymized data or under a paid pilot. Vendors who push back are vendors who do not want to be evaluated honestly.

We have written about our own accuracy methodology in How We Measure Accuracy. The bar should be that vendors can publish their methodology as transparently. Many cannot.

Pillar 2: Workflow fit

A tool can be technically accurate and still wrong for the firm. Workflow fit is the thing that decides daily-use adoption.

Questions that surface fit issues:

  • How does the tool ingest data rooms from the providers we actually use (Datasite, Intralinks, ShareFile, Box, iManage, NetDocuments, raw zips)?
  • Can the risk checklist be configured per deal, per practice group, per partner preference, or is it baked in?
  • Does output match our firm's house style (memo voice, schedule format, redline conventions), or do we end up rewriting?
  • How does the tool handle contracts in languages other than English?
  • Can it produce firm-branded deliverables our clients accept?
  • How does the tool integrate with the document management system we already use?

The tools that win on workflow fit are the ones built around how M&A counsel already work, not the ones that ask the firm to change its process to fit the tool.

Pillar 3: Trust posture (security, privacy, privilege)

This is where a buyer should be most aggressive. Privileged content makes this non-negotiable.

The minimum bar:

  • SOC 2 Type II report. Available on request. Type II (audited operating effectiveness over time) not Type I (point-in-time design).
  • No training on customer data. In writing, in the DPA. Not just a marketing claim. Penalties for breach should be specified.
  • Minimum-required retention. Documents purged when no longer needed for service delivery. The vendor's default retention should be days, not years.
  • Single-tenant or strongly isolated infrastructure. Multi-tenant SaaS with logical separation is acceptable for some firms; many require single-tenant for sensitive deals.
  • Encryption at rest (AES-256) and in transit (TLS 1.3). Same standards as financial institutions.
  • MFA and SSO support. Okta, Azure AD, Google Workspace, SAML 2.0.
  • Comprehensive audit logging. Who accessed what, when. Available for forensic review on request.
  • Incident response procedures. Documented, tested, with notification timelines that meet the firm's regulatory obligations.

We document Mage's specific posture on the security page. The bar should be that any vendor under consideration can answer with the same level of specificity. Vendors who deflect with "we are working on it" should be reconsidered or held to a longer evaluation timeline.

Pillar 4: Output quality

Output is the part the partner actually sees. The bar is "partner edits the language, not the substance."

Concrete quality signals:

  • The first-draft memo is the right length for partner review. Not five-page summaries when a one-pager is right; not one-pagers when the matter requires depth.
  • Citations are precise. Every finding traces to the exact clause in the exact document, with the right amendment if applicable.
  • The voice matches firm conventions. Firms with a particular house style should be able to customize templates.
  • The findings are ordered by severity, not by document order. Partners want the high-impact items at the top.
  • The tool says "I don't know" when it doesn't. False confidence is the worst possible failure mode.

A useful test: ask the tool a question it cannot reasonably know (e.g., "did this counterparty have prior dealings with the seller's parent company?"). A serious tool will say it cannot answer from the data room. A weak one will fabricate a confident-sounding answer.

The questions vendors hate

A few questions reliably separate serious vendors from less serious ones. Ask all of these in the first two calls.

  1. "Show me your accuracy methodology." If the answer is "we are best-in-class" rather than a documented methodology with metrics, recall against ground truth, and a willingness to publish, that is a signal.
  2. "How do you handle amendment chains?" Most generic tools and many self-described legal AI tools cannot do this well. The right answer involves specific architecture (structured extraction, sequential amendment processing) not "our LLM understands context." See Amendment Chain Resolution for what actually matters here.
  3. "What's your hallucination rate, and how do you measure it?" Vendors who say "we don't hallucinate" are not telling the truth; every LLM-based tool hallucinates at some rate. The question is what the rate is and what the architecture does to keep it low. We discuss the issue in LLM Hallucination in Contract Analysis.
  4. "Do you train on customer data, ever?" The answer should be no, with the DPA to back it up.
  5. "Where does my data sit, and who has access?" Single-tenant vs. multi-tenant, geographic location, employee access controls.
  6. "What does a real deal look like with your tool?" Walk-through of an actual deal workflow, not a feature demo. The flow either makes sense for an M&A team or it doesn't.
  7. "What happens when you're wrong?" Vendors who can describe their failure modes credibly are usually the ones whose products are stronger. Vendors who claim no failure modes are bluffing.
  8. "Can I talk to a customer using you on M&A specifically?" Reference calls with named customers in the same use case are gold.

How to structure the pilot

Once a tool clears the four pillars on paper, run a pilot. The structure that works:

Two real deals, in parallel with the manual workflow. One buy-side, one sell-side. Different industries if possible. Different complexity (one straightforward, one with multi-jurisdiction or amendment-chain challenges).

Four-week duration. Enough time to run the deals through; not so long that procurement gets in the way.

Clear success metrics, agreed in advance:

  • Time to partner-reviewable issues list (target: 50%+ reduction).
  • Time to deliverable memo and schedule (target: 50%+ reduction).
  • Recall against manual ground truth (target: ≥ associate baseline).
  • Precision (target: ≥70%, with team comfort that the false-positive rate is workable).
  • Output rewrite percentage (target: <30% of memo, <40% of schedule).
  • Subjective: would the team adopt this tool unprompted?

A tool that hits these metrics on two real deals is the right buy. A tool that misses on more than one is not.

Procurement gotchas

A few practical traps to avoid:

  • Don't price by seat without volume tiers. M&A teams have spiky utilization. Lump-sum pricing or per-deal pricing matches the actual usage curve better.
  • Watch the data-residency clause. "We're hosted in AWS" is not enough. Which region? Which controls?
  • Get the DPA reviewed by privacy counsel. Generic vendor DPAs are written for SaaS, not for legal AI. Specific provisions on training, retention, and sub-processors should be reviewed and negotiated.
  • Negotiate exit clauses. What happens to your data when you leave? Export format, deletion timeline, certification of deletion.
  • Avoid multi-year lock-in early. A one-year contract with renewal is much better than a three-year contract for a tool the firm has used for two months.

Companion reading

This is the buyer-framework guide. The full pillar series:

When you are ready to run a real evaluation: request a demo. We will give you a structured pilot plan, the documents the four pillars require, and a real deal to evaluate against. The right answer should be obvious by the end.

Frequently Asked Questions

What's the single most important question to ask a vendor?

Show me the accuracy on a contract type I care about, on a deal you haven't seen, with output I can compare against my associate's manual review. Vendors who can answer with specifics deserve consideration. Vendors who answer in general language about 'state-of-the-art accuracy' should not.

What's the table-stakes security posture?

SOC 2 Type II report, no training on customer data (in writing in the DPA), retention only for the minimum time required to deliver results, isolated single-tenant infrastructure where possible, AES-256 encryption at rest, TLS 1.3 in transit, MFA, SSO, audit logging. Anything weaker is unacceptable for privileged work.

How long should a pilot last?

Two real deals, ideally one buy-side and one sell-side, ideally with different complexity profiles. Anything shorter is a vendor demo extended; anything longer is procurement dragging its feet. The decision is usually obvious within deal one and confirmed by deal two.

Should we run multiple tools in parallel?

During evaluation, yes. The point of comparison is the comparison. After selection, no — running parallel tools in production multiplies risk surface (privacy, retention, version control on the deliverables) without a corresponding benefit. Pick one, integrate it, train the team.

What's the most common procurement mistake?

Buying based on the vendor demo, then discovering on the first real deal that the accuracy is much lower on the firm's actual document mix than on the demo's hand-curated examples. The countermeasure is to evaluate on real, representative data before signing.

Who in the firm should own the evaluation?

An M&A partner with credibility, paired with a senior associate who will be the daily user. The partner sets the bar for output quality. The associate evaluates fit with the actual workflow. Innovation officers and IT can support but should not lead — they are not the users.

See Mage on a real data room

Bring a current deal. We'll run buy-side or sell-side diligence end-to-end and walk you through the result.

Request a demo