:Harvey:

:Harvey: partners with Voyage to build custom legal embeddings

Announcing a custom embeddings model with Voyage AI, fine-tuned on case law.

Jul 22, 2024

harvey logomark

Harvey Team

Tengyu Ma avatar

Tengyu Ma

Intro

Retrieval-augmented-generation (RAG) is a fundamental component of real-world LLM systems, and a tool we often use to augment our custom models with specialized context. Embeddings are the backbone of RAG, enabling retrieval of items by their semantic meaning and complementing classical search strategies like keyword search. The challenge with standard embeddings, like standard language models, is that they are trained on general corpora of data and therefore struggle to perform in specialized fields. For example, when considered against the entire universe of text, legal jargon is all relatively similar—this prevents embedding-based retrieval methods from disambiguating relevant text from the rest of the data.

Voyage AI

Voyage AI, led by Stanford professor Tengyu Ma, is a leading developer of customized embedding models and LLM retrieval infrastructure. Voyage has assembled a world-class AI research team that has developed novel techniques that enable embeddings to better capture the nuances of specialized text in the same way as domain experts. For example:

Given their track record of building domain-specific embedding models, we were excited to partner with the Voyage AI team to fine-tune embeddings specifically for Harvey use cases.

Custom Embeddings

Together, we collaborated to fine-tune an embedding model on US case law — more than 20 billion tokens of legal text where even the best standard embedding models struggle to distinguish cases relevant to common questions. Starting from voyage-law-2 as a base, our model was trained on both the raw case law text itself, using Voyage AI’s proprietary self-supervised techniques, and subsequently on a dataset of exemplar questions and expert annotations on relevant cases collected by our legal research team.

Voyage’s custom training work has been immediately impactful. We evaluated our model and other leading embedding models on our Harvey legal retrieval task — a large dataset of query-content pairs generated from a variety of legal documents — and used Normalized Discounted Cumulative Gain (NDCG@10) and Recall at 100 items (Recall@100) as performance metrics (both are standard metrics for retrieval quality). Our custom embedding model, named voyage-law-2-harvey, reduces the amount of irrelevant material returned in top results by nearly 25% compared to the next best off-the-shelf embedding models (e.g. Google’s text-embedding-004 or OpenAI’s text-embedding-3-large). It is able to accomplish this with 1/3 of the embedding dimensionality, leading to significant benefits in storage and latency. We have also combined voyage-law-2-harvey’s more robust understanding of legal text with other proprietary search methods, further improving the ability of our retrieval systems to identify relevant cases and passages from cases in response to complex legal questions.

Next Steps

We are excited to continue working with Tengyu and Voyage to develop a suite of custom embeddings models for legal (and beyond), as well as work with our clients to create firm/company-specific embeddings for enterprise search, RAG systems, and other GenAI applications.

Credits: Aravind Srinivasan[1], Calvin Qi[1], Wen Phan[2], Daniel Hunter[1], Julio Pereyra[1], Niko Grupen[1], Tengyu Ma[2], Gabriel Pereyra[1]

[1] Harvey

[2] Voyage AI

Unlock professional class AI for your firm