A/B Test Prompts and Models
A/B testing with large language models in production is crucial for driving optimal performance and user satisfaction.
It helps you find and settle on the best model for your application (and use-case).
This cookbook will guide us through setting up an effective A/B test where we measure the performance of 2 different prompts written for 2 different models in production.
If you prefer to follow along a python notebook, you can find that here.
The Test
We want to test the blog outline generation capabilities of OpenAI’s gpt-3.5-turbo
model and Google’s gemini-pro
models which have similar pricing and benchmarks. We will rely on user feedback metrics to pick a winner.
Setting it up will need us to
- Create prompts for the 2 models
- Write the config for a 50-50 test
- Make requests using this config
- Send feedback for responses
- Find the winner
Let’s get started.
1. Create prompts for the 2 models
Portkey makes it easy to create prompts through the playground.
We’ll start by clicking Create on the Prompts tab and create the first prompt for OpenAI’s gpt-3.5-turbo.
You’ll notice that I’d already created virtual keys for OpenAI and Google in my account. You can create them by going to the Virtual Keys tab and adding your API keys to Portkey’s vault - this also ensures that your original API keys remain secure.
Let’s start with a simple prompt. We can always improve it iteratively. You’ll notice that we’ve added variables to it for title
and num_sections
which we’ll populate through the API later on.
Great, this is setup and ready now.
The gemini model doesn’t need a system
prompt, so we can ignore it and create a prompt like this.
2. Write the config for a 50-50 test
To run the experiment, lets create a config in Portkey that can automatically route requests between these 2 prompts.
We pulled the id
for both these prompts from our Prompts list page and will use them in our config. This is what it finally looks like.
We’ve created a load balanced config that will route 50% of the traffic to each of the 2 prompt IDs mentioned in it. We can save this config and fetch its ID.
Create the config and fetch the ID
Create the config and fetch the ID
3. Make requests using this config
Lets use this config to start making requests from our application. We will use the prompt completions API to make the requests and add the config in our headers.
As we make these requests, they’ll show up in the Logs tab. We can see that requests are being routed equally between the 2 prompts.
Let’s setup feedback for these APIs so we can begin our tests!
4. Send feedback for responses
Collecting and analysing feedback allows us to find the real performance of each of these 2 prompts (an in turn gemini-pro
and gpt-3.5-turbo
)
The Portkey SDK allows a feedback
method to collect feedback based on trace IDs. The pcompletion object in the previous request allows us to fetch the trace ID that portkey created for it.
5. Find the winner
We can now compare the feedback for the 2 prompts from our feedback dashboard
We find that the gpt-3.5-turbo
prompt is at 4.71 average feedback after 20 attempts, while gemini-pro
is at 4.11. While we definitely need more data and examples, let’s assume for now that we wanted to start directing more traffic to it.
We can edit the weight
in the config to direct more traffic to gpt-3.5-turbo
. The new config would look like this:
This directs 80% of the traffic to OpenAI.
And we’re done! We were able to set up an effective A/B test between prompts and models without fretting.
Next Steps
As next explorations, we could create versions of the prompts and test between them. We could also test 2 prompts on gpt-3.5-turbo
to judge which one would perform better.
Try creating a prompt to create tweets and see which model or prompts perform better.
Portkey allows a lot of flexibility while experimenting with prompts.
Bonus: Add a fallback
We’ve noticed that we hit the OpenAI rate limits at times. In that case, we can fallback to the gemini prompt so the user doesn’t experience the failure.
Adjust the config like this, and your fallback is setup!
If you need any help in further customizing this flow, or just have more questions as you run experiments with prompts / models, please reach out to us at [email protected] (We reply fast!)
Was this page helpful?