The OpenAI Evals framework consists of
- A framework to evaluate an LLM (large language model) or a system built on top of an LLM.
- An open-source registry of challenging evals
This notebook will cover:
- Introduction to Evaluation and the OpenAI Evals library
- Building an Eval
- Running an Eval
What are evaluations/ evals
?
Evaluation is the process of validating and testing the outputs that your LLM applications are producing. Having strong evaluations ("evals") will mean a more stable, reliable application that is resilient to code and model changes. An eval is a task used to measure the quality of the output of an LLM or LLM system. Given an input prompt, an output is generated. We evaluate this output with a set of ideal answers and find the quality of the LLM system.
Importance of Evaluations
If you are building with foundational models like GPT-4
, creating high quality evals is one of the most impactful things you can do. Developing AI solutions involves an iterative design process. Without evals, it can be very difficult and time intensive to understand how different model versions and prompts might affect your use case.
With OpenAI’s continuous model upgrades, evals allow you to efficiently test model performance for your use cases in a standardized way. Developing a suite of evals customized to your objectives will help you quickly and effectively understand how new models may perform for your use cases. You can also make evals a part of your CI/CD pipeline to make sure you achieve the desired accuracy before deploying.
Types of evals
There are two main ways we can evaluate/grade completions: writing some validation logic in code or using the model itself to inspect the answer. We’ll introduce each with some examples.
Writing logic for answer checking
The simplest and most common type of eval has an input and an ideal response or answer. For example, we can have an eval sample where the input is "What year was Obama elected president for the first time?" and the ideal answer is "2008". We feed the input to a model and get the completion. If the model says "2008", it is then graded as correct. We can write a string match to check if the completion includes the phrase "2008". If it does, we consider it correct.
Consider another eval where the input is to generate valid JSON: We can write some code that attempts to parse the completion as JSON and then considers the completion correct if it is parsable.
Model grading: A two stage process where the model first answers the question, then we ask a model to look at the response to check if it’s correct.
Consider an input that asks the model to write a funny joke. The model then generates a completion. We then create a new input to the model to answer the question: "Is this following joke funny? First reason step by step, then answer yes or no" that includes the completion." We finally consider the original completion correct if the new model completion ends with "yes".
Model grading works best with the latest, most powerful models like GPT-4
and if we give them the ability
to reason before making a judgment. Model grading will have an error rate, so it is important to validate
the performance with human evaluation before running the evals at scale. For best results, it makes
sense to use a different model to do grading from the one that did the completion, like using GPT-4
to
grade GPT-3.5
answers.
OpenAI Eval Templates
In using evals, we have discovered several "templates" that accommodate many different benchmarks. We have implemented these templates in the OpenAI Evals library to simplify the development of new evals. For example, we have defined 2 types of eval templates that can be used out of the box:
-
Basic Eval Templates: These contain deterministic functions to compare the output to the ideal_answers. In cases where the desired model response has very little variation, such as answering multiple choice questions or simple questions with a straightforward answer, we have found this following templates to be useful.
-
Model-Graded Templates: These contain functions where an LLM compares the output to the ideal_answers and attempts to judge the accuracy. In cases where the desired model response can contain significant variation, such as answering an open-ended question, we have found that using the model to grade itself is a viable strategy for automated evaluation.