:Harvey:

BigLaw Bench Workflows: SPA Deal Points

Expanding BigLaw Bench to evaluate legal workflow agents, starting with extracting deal points from Share Purchase Agreements (SPAs).

Sep 26, 2024

harvey logomark

Harvey Team

We are expanding our publicly-released BigLaw Bench data to include data we use to benchmark, evaluate, and improve legal workflow agents. These workflow datasets represent a large number of samples of a complex, recurring task and are used to evaluate and improve agent systems on these hard reasoning problems. The first dataset we are releasing is extracting deal points from Share Purchase Agreements (SPAs). Although these documents fit within the context window of foundation models, they struggle to reason effectively over these complicated agreements, correctly identifying only between 66.04% (GPT-4o) and 72.27% (Gemini) when given a basic set of deal points. In contrast, Harvey’s SPA agents are able to extract 98.47% of deal points correctly across diverse SPA documents. Here, we discuss our results and provide a subset of both the SPA data and the deal points schema.

Dataset

We collected a large number of SPAs from the SEC’s website and other sources. Our legal research team reviewed these SPAs and identified a subset of documents that were representative of the diversity and complexity of documents they worked with during their time at BigLaw firms. A sample of this dataset is available here. Each SPA was annotated according to a schema of deal points (example JSON) defined in collaboration between our legal research team and clients. We continue to actively solicit and collect feedback to improve our primary schema and define custom schemas for specialized private use cases across document types.

Workflow

Building an agentic workflow using this dataset required understanding the reasoning challenges that standard models faced with correctly understanding and extracting deal points. By categorizing these issues and asking our research team to trace the correct way to reason about these deal points, we were able to identify common patterns that an agentic system would need to improve against in order to perform to our requirements. For example, all foundation models struggle to identify the indemnification cap with respect to Purchaser Fundamental Representations in the HealthEquity SPA. Correctly identifying this cap requires building systems that are able to reason effectively across multiple layers of cross-references, as this cap requires a model to consider, at least:

  1. Section 5.2(d): defining the cap as "limited to an amount equal to the Purchase Price actually due to the Sellers”;
  2. The definition of “Purchase Price”: as an amount equal to the Final Consideration, plus the Escrow Amount (if any released to the Sellers);
  3. The definition of “Final Consideration”: as “(i) the Base Consideration, plus [a number of other factors]”;
  4. The definition of “Base Consideration”: which “has the meaning set forth in Section 2.4”;
  5. And finally to Section 2.4: which sets the Base Consideration at $50,000,000.

Then reason back through these various layers to correctly contextualize the value of this deal point. These traces, along with other qualitative analyses, were used to build a system composed of multiple LLMs and traditional ML techniques that effectively reason about and extract the relevant deal points from the target documents. Because the entire system is specialized for a single task, we are able to achieve better performance than is possible in the base Harvey platform and significantly better than the capabilities of general purpose foundation models.

In addition to accuracy, our system is designed to deliver on two other key components of an effective LLM answer—transparent reasoning and sources. Foundation models may return partially correct answers, but they do so without elaboration. Asking for details often confounds models or locks them into poor reasoning patterns, making scores worse. In contrast, Harvey’s agents are able to deliver exceptional accuracy, and build confidence in these results by providing both a reason for each data point returned and source(s) that backs up its conclusion.

Evaluation

We evaluated our agentic system, as well as existing foundation models, on the BigLaw Bench SPA set. This set consists of a set of SPAs not used in the building of our agentic system to ensure generalization of processes and avoid overfitting. For evaluation, we compute accuracy compared to the ground truth deal points as extracted and verified by BigLaw attorneys. For fields that are not string-typed (e.g. dates, numbers, booleans), we perform exact matches once the fields are normalized. For text based fields we used a model to compare matches and verified model grader efficacy with BigLaw attorneys. Overall accuracy is computed as a straight average over the test set and all the individual fields.

Next steps

We plan to continue expanding BigLaw Bench to include more workflows and more complex tasks. By continuing to make BigLaw Bench methodologies and data public we will provide the industry with ideas for more rigorous and realistic benchmarks and encourage other providers to do the same. We think this is the first step towards more standardized benchmarking for the industry as a whole.

We also hope these workflows serve as inspiration for our present and future clients. We’ve had a lot of requests for workflows and will be incorporating them into the Assistant and Vault products over the next few months.

Unlock professional class AI for your firm