Blog

Harvey is building legal agents and workflows with OpenAI o1

Today, OpenAI announced their newest series of reasoning models: OpenAI o1. This new model offers substantial improvements on Harvey’s leading indicators of model performance: expert preference and BigLaw Bench. It has also shown game-changing results in both our internal evaluations and collaborative model training projects. We’re thrilled to start building the next generation of legal agents and workflows with this new series of models. 

The most exciting thing about the OpenAI o1 models is their ability to unlock advanced AI agents that can exist within more complex model systems. As model and systemic complexity increases, however, domain alignment becomes both more critical and difficult to achieve. In this post, we share how Harvey evaluates new models like OpenAI o1 and how Harvey will leverage these model improvements to further deliver on our promise of domain-specialized AI.

Evaluating OpenAI o1

A brand new foundation model presents a massive surface area for evaluation. Since the alpha release of GPT-4, Harvey has worked to develop and refine several evaluation methods to identify the most impactful way to use new models. 

Our typical evaluation focuses on two major categories: objective benchmarks and expert preference. Objective benchmarks are evaluative tasks reducible to a set of numbers that entails absolute model performance. The core set of these tasks make up our BigLaw Bench, and they are complemented by auxiliary benchmarks for systems like citation, extraction, and document understanding. 

Harvey also obtains expert preference of model outcomes from our legal research team—made up of lawyers from leading law firms. By ensuring our evaluations include human review, we can confidently capture qualitative preferences and nuanced review of legal judgment and style that is hard to capture in purely objective benchmarks. 

On both these benchmarks, without further training or customizations, OpenAI o1 impresses; it outperforms existing foundation models, though falls short on performance compared to Harvey’s domain-specialized assistant models.

Open AI o1’s performance on evaluations of agentic capabilities including reasoning, planning and collaboration is—by far—the thing we are most excited about.

For models where we anticipate novel performance, Harvey supplements these conventional evaluations with “step-change” review: a set of evaluations designed at figuring out net new functionalities. Harvey’s step-change review emphasizes capabilities that would enable the next generation of professional service agents: reasoning, planning, and collaboration. Open AI o1’s performance on these evaluations is—by far—the thing we are most excited about.

From Copilots to Coworkers

In the last two years, Harvey has spent countless hours pushing the efficacy of frontier models on legal and professional service tasks. This work has provided perspective on not only where the models are now, but what would need to change for the models to get to the next level. In short, the next generation of models (from a performance perspective) are those that are capable of being fully realized participants in professional service projects—capable of working seamlessly with experts to iteratively solve complex problems.  

There are a few things that limited the prior generation of models from fully participating in this iterative problem solving. For one, they lacked the ability to extract intent from users by asking incisive questions to further develop an appropriate solution for hard or underspecified problems. On the other hand, they also lacked the ability to effectively plan multi-part work products and share those plans with users for constructive feedback. Collectively, these prevented the models from consistently interacting productively with users on complex, evolving problems—limiting the models to well-specified tasks.

The OpenAI o1 model shows promise in breaking these limitations. It can predict and plan out multi-stage tasks with more acumen and specificity than prior models. It also knows when and how to ask for help, being able to surface relevant ambiguities or necessary details to a user. Most importantly, it can reflect on and reason about the feedback it receives in both of these modalities—enabling user input to be a high-leverage way to steer the model towards optimal outcomes on multi-step reasoning problems. 

Developing legal agency at Harvey

While OpenAI o1 shows glimpses of this skillset, its ability to translate these talents to legal and other professional service domains remains nascent. Although it collaborates, plans, and reasons exceptionally well, it does not yet do these things with respect to legal work and processes. Fortunately, closing this domain-specific reasoning gap is exactly what the Harvey team has spent two years solving. 

In particular, OpenAI o1’s reasoning capabilities are optimized for largely deterministic problems, like those presented in coding and math. Law and professional services, in contrast, generally present problems deeply rooted in context—where most reasoning requires considering multiple sides of a problem and making judgment calls that draw on subjective context. A more generalist model can struggle to plan and reason in these contexts, thinking linearly and myopically about solutions which can lead it even further astray than prior, less thoughtful models.

To solve this problem, Harvey’s legal and ML research teams work closely, both in-house and with partners at OpenAI to identify ways to align models for these domain-specific reasoning problems. From identifying and curating relevant datasets, to generating novel forms of human data, these teams work interdisciplinarily to ensure models think about and solve problems the same way lawyers do.

Harvey’s legal and ML research teams work to align new model capabilities to domain-specific reasoning problems by supplementing foundation models with legal and process data.

Producing foundation models, supplemented with the correct legal and process data, that perform far beyond their original baselines has always been a central research project at Harvey. As new capabilities of models emerge, ensuring that these capabilities are aligned for domain-specific tasks becomes an even more important, and impactful, endeavor. With new ways to improve agentic model behaviors, Harvey is positioned to build a uniquely effective partner in the professional services space. We believe this more profound and effective collaboration between human and AI holds the key to unlocking even more valuable applications in professional services.

What’s next at Harvey

OpenAI o1’s substantial strides in conventional LLM functionality—including fluency, detailed writing, and better reasoning—combined with its novel potential to collaborate on hard problems sets the tone for the next generation of Harvey solutions. In particular, we believe Harvey’s next generation of agents will unlock a set of workflows and use cases where collaboration is essential to achieving a valid work product. 

For example, “I want to draft an S-1” is a deeply underspecified LLM prompt unless the model can co-develop and co-execute a plan with the user in an iterative and interactive manner. An S-1 is a lengthy and complex legal document that companies must file with the SEC when planning an IPO. Drafting it involves detailed disclosures about the company’s business operations, financial health, risk factors, and management structure, all while complying with strict regulatory standards. The process requires collaboration over several months among legal teams, financial advisors, auditors, and executives.  

An advanced AI agent would start by asking detailed questions about the specifics of the company preparing to go public, such as its business model, financial performance, and management structure. It would then assist in gathering and organizing the necessary information, such as risk factors, financial statements and disclosures required by the SEC. By collaboratively developing a comprehensive outline and ensuring compliance with regulatory standards, the AI helps produce a tailored and accurate S-1 registration statement. The iterative process involves ongoing collaboration with legal teams, executives, and other stakeholders – and building systems that can take in these types of requests, solicit the required information, and bring to the interaction the context and understanding needed to leverage that information is critical to evolving the way that AI is used for professional services.

The next generation of models, alongside agentic systems and UIs, will allow Harvey to build collaborative agents that work with legal experts on their hardest problems.

Developing both the model capabilities and the platform frameworks to actually engage in these interactions is the next frontier. OpenAI o1’s novel reasoning capabilities makes solving these problems possible, but it is only one piece of the puzzle. Building truly seamless collaboration around these legal problems requires not only specializing OpenAI o1, it requires building dozens of other model systems to complement it, and then packaging these systems in effective and intuitive user interfaces far beyond existing LLM interaction mediums like chat or prompt-based generation. 

At Harvey, building the systems that enable domain-specialized models to facilitate the world’s most complex legal work has always been our mission. The next generation of models will allow us to further that mission with collaborative agents built to work with legal experts on their hardest problems.