BigLaw Bench – Retrieval

Harvey’s retrieval system outperforms commonly used embedding-based and reranking methods, identifying up to 30% more relevant content than alternative retrieval methods across a diverse range of legal document types.

Nov 13, 2024

Niko Grupen

Julio Pereyra

Retrieval-augmented generation (RAG) is a critical element of practical Large Language Model (LLM) systems. Despite its importance, RAG is often trivialized and interpreted as a simple semantic search, where relevant content is retrieved by comparing the similarity between embedded passages of text. Building high-quality retrieval systems, however, requires supplementing embedding-based semantic search with domain-specific (often also LLM-based) data preprocessing, metadata extraction, embedding fine-tuning, and re-ranking or filtering techniques.

Thus far, our evaluation work through BigLaw Bench has focused on measures of answer quality and its associated factors (e.g. hallucinations, citations) – emphasizing the "generation" step in RAG. But finding relevant content from a dataset to answer a given query – the retrieval step – is an equally important component of these systems. To that end, we are expanding BigLaw Bench to include retrieval-focused benchmarks over legal documents. On these tasks, Harvey’s retrieval system outperforms commonly used embedding-based and reranking methods, identifying up to 30% more relevant content than alternative retrieval methods from OpenAI and other embedding model providers (Voyage, Cohere) across a diverse range of legal document types. These leading techniques are used throughout Harvey’s products across Assistant, Knowledge Sources, Vault, and Workflows when analyzing everything from complex documents to large databases.

Legal Retrieval Dataset

Retrieval in legal domains presents a number of unique challenges not contemplated by typical RAG solutions. Indeed, different documents within the legal domain have diverse properties that affect optimal ways to retrieve key information, for example:

Contracts: Complex documents (e.g., hundreds of pages and potentially hundreds of thousands of tokens of text) with cross-references and defined terms which must be tracked to effectively contextualize relevant text.
Discovery Emails: Relatively short documents that come in high-volume and have complex relationships (e.g., email threads) and rich metadata (sender, recipient, attachments) essential to identifying relevant messages.
Research Databases: Datasets like case law consisting of millions of documents, where each document can itself be hundreds of pages and where relationships like recency cannot be captured by vanilla semantic search.

Optimizing information retrieval against these documents requires not only a deep understanding of their structure but also an understanding of how and why lawyers want to search against these documents. For example, in case law, lawyers care that precedents are not only relevant but also that they are recent and controlling in a jurisdiction. Conversely, when finding precedents in Merger Agreements, factors like industry, deal size, and more can make information more or less relevant. Understanding and capturing this nuance in a retrieval dataset is essential to improving RAG performance in a way that matters to our clients.

To do so, we collected comprehensive examples of two types of datasets. The first are research datasets, collections of documents from research databases like case law, legislation, EUR-Lex, and published law firm memoranda. The second are document collections, large scale aggregations of a particular document type that are typically queried together in practice such as contracts (queried to find precedents or specific deal terms) and emails (queried as part of discovery or investigatory work). For each dataset, we used a mix of human experts and AI systems to generate a large number of salient queries that would be asked against the relevant dataset or document type. Then for each question, our AI-assisted experts annotated relevant material that would need to be retrieved in order to correctly respond to each question. Examples of queries used for each dataset can be found here.

Evaluation

We evaluated our proprietary Harvey retrieval system, as well as existing embedding and reranking models, on the BigLaw Bench Retrieval dataset. We measured performance as the percentage of relevant content found by the retrieval system for a given user query, graded against a ground truth established in advance by our in-house legal research team. In technical terms, this is equivalent to recall at a fixed token threshold. Below we show retrieval results across both contracts and research datasets. Our tailored systems demonstrate superior performance, consistently identifying a greater proportion of relevant material compared to traditional embedding- or reranker-based methods.

Relevant material retrieved: Overall and By category

The evaluation results also show the benefit of task-specific optimization. Conventional RAG system’s performance on Merger Agreements (“MAs”) and Stock Purchase Agreements (“SPAs”) varies dramatically, although lawyers would typically consider these complex contracts as relatively similar—at least as compared to documents like court opinions. These gaps confirm the value in thinking about domain- and task-specific retrieval in order to maximize the efficacy of AI systems on any particular task.

Relevant material retrieved: MA and SPA datasets

Reviewing the data overall reveals a number of common patterns for why Harvey’s systems achieve best in class performance. These include:

Metadata: Adding metadata to contextualize passages of text, allowing retrieval systems to contextualize those passages within a complicated document.
Features: Capturing features like recency that are typically not accounted for in typical semantic search systems.
LLM-based retrieval: Using language models to reason about hard relevancy judgments and identify semantic patterns not captured by more coarse AI systems like embeddings.

Combined these patterns differentiate Harvey’s retrieval across complex legal datasets. They also provide avenues for deeper research to continue pushing the quality of retrieval for legal tasks.

Next Steps

Even with advances in model training, RAG remains an essential tool for ensuring that models have factual, up-to-date information at their disposal when answering questions grounded in source of truth materials. Our clients operate globally, and rely on information from innumerable sources to provide advice and services. Ensuring high quality retrieval across these varied sources is our primary goal. To this end, we intend to continue extending our retrieval benchmarks to cover all of the datasets and document types that lawyers routinely engage with.

As model systems become more complex, retrieval efficacy becomes even more important for allowing them to contextualize problems, identify relevant information, and execute complex tasks. An AI agent drafting a brief may need to search a docket to understand the procedural posture of a case; parse a discovery corpus to understand its facts; tap into a firm’s DMS to identify one (or more) relevant prior briefs to build from; and then search a case law and legislative database to find all the precedents needed to draft a winning argument. Our RAG benchmarks provide a framework for ensuring that these AI systems, and others, can enable lawyers to surface and leverage critical information more efficiently and effectively in all aspects of their practice.

Contributors: Julio Pereyra, Niko Grupen, Nan Wu, Boling Yang, Joel Niklaus, Matthew Guillod, Laura Toulme, Lauren Oh