BigLaw Bench — Hallucinations

On legal tasks, Harvey's models hallucinate at a lower rate than the foundation models despite providing longer and more detailed answers.

Oct 7, 2024

Harvey Team

Hallucinations are one of the most common concerns with Large Language Models (LLMs). LLMs’ ability to invent spurious but credible sounding content creates trust issues, even if hallucinated content is rare as compared to valid analysis and insights. In this post, we discuss how we define hallucinations, how we detect them, and how our systems for reducing hallucinations perform on BigLaw Bench. On BigLaw Bench tasks, Harvey’s Assistant model hallucinates around 1 in 500 claims (.2%). Foundation models hallucinate more frequently, with rates between 1 in 150 (.7%, Claude) and 1 in 75 (1.3%, Gemini). Notably, Harvey’s models hallucinate at a lower rate than the foundation models despite providing longer and more detailed answers.

Defining Hallucinations

Although hallucinations are commonly discussed, there does not appear to be a universally accepted definition for the term. At Harvey, we define hallucinations as: a factual claim made by an LLM that can be demonstrably disproven by reference to a source of truth. We do not count as hallucinations related failure modes that may also lead models to produce irrelevant or unhelpful information such as errors in understanding or reasoning. These failure modes are measured in other ways in BigLaw Bench and internal evaluations.

The most common way to reduce factual hallucinations is through retrieval-augmented generation (RAG), which grounds model answers in document(s) or other sources of data that provide factually accurate and up-to-date information. However, even when models are provided with source of truth documents or datasets, hallucinations are still possible. Tuning model systems to mitigate these document-based hallucinations is a critical research effort at Harvey.

Measuring Hallucinations

The first step to reducing hallucinations is identifying when a model answer contains hallucinated content. To do so, we deploy a system of models that perform two main tasks. First, models break down an answer into all of its relevant factual claims. Second, models consider whether each factual claim made in the answer is true based on the information in the source of truth documents. This system is optimized by our research teams to confirm that model judgments are closely aligned with human hallucination review.

We deploy this system on the answers to a subset of BigLaw Bench tasks that require reasoning over multiple, long documents. These types of tasks are cases we have found most vulnerable to hallucinations even when models are provided grounding information. We then performed human review of all model judgments to confirm system alignment.

After human review, we compute the hallucination rate for each model. Hallucination rates are computed as the number of sentences containing a hallucinated claim divided by the total number of sentences in a response. The below tables show total response sentences and hallucination rates for Harvey and leading foundation models on the evaluated tasks.

Models	Total Response Sentences	Hallucination Rate
Harvey Assistant	1688	0.2%
Claude	1140	0.7%
ChatGPT	1176	1.3%
Gemini	1067	1.9%

Correcting Hallucinations

The next step beyond detecting and minimizing hallucinations are agents that can proactively correct hallucinations in their own workflows. Currently, catching anything beyond the most obvious or significant hallucinations is too slow for copilot-style interactions, with models spending many minutes checking and correcting work that took seconds to generate. But as model work becomes more substantial and comprehensive, the value of these self-review processes increase exponentially. Agents performing more complicated tasks will be able to check their work to do further research, eliminate hallucinations, and raise true ambiguities to experts to confirm understanding. These well-aligned agents will create more human interaction at the limits of hard problems, stating what they don’t know and asking for clarification or additional context rather than attempting to fill in understanding gaps.

Next Steps

At Harvey, we are committed to building agents that deeply understand the information they know and clearly understand the information they don’t. These well-grounded agents are essential to supporting experts on complex workflows where auditable factuality is critically necessary. In the near future, we plan to deploy systems to detect and correct hallucinations in our more complex workflows. We also plan to make tracing model reasoning on these workflows more transparent to users, providing a combination of citations and model reasoning that establish the validity of hallucination-checked analyses. These systems are the first steps towards additional ways to build trust that complex LLM reasoning is hallucination-free and reliable for the most sophisticated legal work.