Introducing BigLaw Bench
Presenting BigLaw Bench—a version of our internal dataset for evaluating large language models (LLMs) and model systems on complex legal tasks.
Aug 29, 2024
Harvey Team
Overview
Today we are announcing the first public results of BigLaw Bench—a public-facing version of our internal dataset for evaluating large language models (LLMs) and model systems on complex legal tasks. BigLaw Bench is a framework for quantitatively evaluating the performance of LLMs on real-world legal tasks; supplementing prior work that measures LLM legal reasoning in more structured settings. Harvey’s proprietary models significantly outperform publicly available LLMs, but all models show substantial room for improvement when benchmarked against full completion of tasks performed by lawyers. The results are promising. They show that legal AI systems have the potential to significantly improve the efficiency of lawyers by completing real world tasks.
BigLaw Bench Tasks
“Existing multiple-choice or one-size-fits-all benchmarks are insufficient to capture the real billable work that lawyers do.”
At Harvey, our goal is always to build and evaluate LLMs for actual, billable work performed by lawyers and professionals reflective of our client base. Our benchmarks are conceptualized and designed by our legal research team, composed of attorneys with experience across a wide range of practice areas and BigLaw firms, and constructed using publicly available documents to enable transparent evaluation under realistic conditions.
In defining benchmark tasks, we draw on a familiar feature of the legal profession: the time entry. Time entries cover all of the tasks that lawyers do in service of their clients and provide a concise way to convey the task performed and the value of that work. Because all work is billed, building a comprehensive set of time entries closely reflects the set of legal tasks that make up real-world practice.
The challenge with evaluating models on these tasks is that most of the work that lawyers do—assessing risk, drafting documents, thinking of arguments, and advising clients on novel legal developments—is too complicated to be graded by multiple-choice or other one-size-fits-all criteria which are the hallmarks of existing benchmarks. Effectively capturing how much models can facilitate this kind of work required developing far more sophisticated and task-specific grading criteria than those used in prior benchmarks.
By converting time entries into model-based tasks (prompt / document pairs), we are able to identify and evaluate how models can contribute to real, high-value legal work. These tasks are then sub-divided by other relevant taxonomies, such as the practice area—litigation or transactional—and the portion of a matter the task would facilitate. The following tables outline the major categories and distribution of BigLaw Bench core tasks, i.e. tasks that we use to test fundamental model capabilities to solve problems involving reasoning about, analyzing, and discussing legal concepts and text without the need for multi-part or agentic workflows.
Evaluation Methodology
Harvey’s research team developed bespoke rubrics to evaluate each task. These rubrics establish objective criteria that would be necessary for a model response to effectively accomplish a given task. They also penalize common LLM failure modes such as incorrect tone or length, irrelevant material, toxicity, and hallucinations. Combined, these rubrics effectively capture everything a model must do to effectively complete a task, and everything it must avoid to ensure it completes that task in a safe, effective, and trustworthy manner.
In order to convert these criteria into a benchmark, each affirmative requirement was assigned a positive score based on its importance to completing the relevant task. Negative criteria, such as hallucinations, were assigned negative scores as the human effort to correct errors can reduce the utility of an otherwise complete piece of AI work product. Answer score is computed by taking the combination of these positive and negative points and dividing by the total number of positive points available for a task.
“Answer score represents: What % of a lawyer- quality work product does the model complete for the user?”
In addition to grading the substantive content produced by the model, we also benchmark the verifiability of a model’s answer through a source score. A task’s source score is derived by identifying the substantive points a model was required by a rubric to make and for which a reference to a source document would be needed in order to verify that point. For example: if a substantive point in the rubric was, “The model’s answer must include the termination fee is $10,000,000,” the sourcing point would be “the model must provide a source for its statement that the termination fee is $10,000,000.” A source was defined as any statement or link affirmatively connecting a sentence needing a source to a specific document or part of a document that proves that point.
“Source score represents: What % of correct statements does the model support with an accurate source?”
Though related, source score and answer score are independent. A model can make a number of correct assertions (high answer score) while failing to provide traceability of those assertions to relevant source documents to facilitate user trust and validation (low source score).
A sample of instantiated tasks and their associated rubrics are included in the Appendix.
Results
“Harvey's proprietary models outperform leading foundation models on domain-specific tasks, producing 74% of a final, expert lawyer-quality work product - the outputs are more detailed, capture more nuance, and are much closer to final lawyer quality.”
The below graphs summarize model performance on BigLaw Bench core tasks. On answer scores, Harvey’s proprietary assistant models outperform each of the leading foundation models, producing outputs that are more detailed and materially closer to a final legal work product. In general, we found that public foundation models provided, on average, reasonably strong answers—solving the blank page problem and getting users more than halfway to a final work product—but often lacked specificity or missed key legal nuances. The advantage of Harvey models remains when tasks are split into litigation and transactional. Overall stronger model performance on transactional tasks is driven by those tasks being, on average, more analytical with litigation tasks tending to require the models to engage more with ideation and argumentation—areas where foundation models tended to underperform.
On source scores, performance differences are far more pronounced. Besides Harvey, only ChatGPT provided sources to documents for a meaningful number of tasks when sources were not explicitly requested in the prompt. To ameliorate this, the Harvey team attempted to derive custom prompts and instructions for all models that would require sources to be included after affirmative statements about the document. These prompts had a substantial negative impact on model performance, as the foundation models would consistently hallucinate sources (either document text, page number, or both) leading to weak source scores and far worse answer scores. In short, the foundation models provide good answers but have trouble showing their work, even when explicitly asked.
Next Steps
BigLaw Bench provides a framework for effectively benchmarking model performance on legal tasks regardless of complexity. However, it currently emphasizes tasks that today’s models can or should be able to do. This list of tasks falls far short of the goal of providing a benchmark of all tasks that lawyers must perform to deliver value for their clients. Many of these tasks remain far beyond the reach of LLMs and even the most sophisticated LLM agents. Mapping and benchmarking this full range of tasks will be necessary for the effective development of domain-specialized AI. As we continue to build this more ambitious benchmark, we intend to deepen and formalize our collaboration with both client and industry partners like vals.ai. These collaborations are essential to developing an industry standard benchmark for measuring and improving the ability of AI systems to perform the most complex knowledge work.
Appendix
Here we provide an example task from both the litigation and transactional categories for reference. A more comprehensive description of exemplar tasks sampled from BigLaw Bench, including associated rubrics, can be found here.
Credits: Julio Pereyra, Elizabeth Lebens, Matthew Guillod, Laura Toulme, Cameron MacGregor, David Murdter, Karl de la Roche, Emilie McConnachie, Jeremy Pushkin, Rina Kim, Aaron Chan, Jenny Pan, Boling Yang, Nan Wu, Niko Grupen, Lauren Oh, Aatish Nayak, Gabriel Pereyra