A/B Test Prompts and Models

A/B testing with large language models in production is crucial for driving optimal performance and user satisfaction. It helps you find and settle on the best model for your application (and use-case).

This cookbook will guide us through setting up an effective A/B test where we measure the performance of 2 different prompts written for 2 different models in production.

If you prefer to follow along a python notebook, you can find that here.

The Test

We want to test the blog outline generation capabilities of OpenAI's gpt-3.5-turbo model and Google's gemini-pro models which have similar pricing and benchmarks. We will rely on user feedback metrics to pick a winner.

Setting it up will need us to

  1. Create prompts for the 2 models

  2. Write the config for a 50-50 test

  3. Make requests using this config

  4. Send feedback for responses

  5. Find the winner

Let's get started.

1. Create prompts for the 2 models

Portkey makes it easy to create prompts through the playground.

We'll start by clicking Create on the Prompts tab and create the first prompt for OpenAI's gpt-3.5-turbo.

You'll notice that I'd already created virtual keys for OpenAI and Google in my account. You can create them by going to the Virtual Keys tab and adding your API keys to Portkey's vault - this also ensures that your original API keys remain secure.

Let's start with a simple prompt. We can always improve it iteratively. You'll notice that we've added variables to it for title and num_sections which we'll populate through the API later on.

Great, this is setup and ready now.

The gemini model doesn't need a system prompt, so we can ignore it and create a prompt like this.

2. Write the config for a 50-50 test

To run the experiment, lets create a config in Portkey that can automatically route requests between these 2 prompts.

We pulled the id for both these prompts from our Prompts list page and will use them in our config. This is what it finally looks like.

{
  "strategy": {
    "mode": "loadbalance"
  },
  "targets": [{
    "prompt_id": "0db0d89c-c1f6-44bc-a976-f92e24b39a19",
    "weight": 0.5
  },{
    "prompt_id": "pp-blog-outli-840877",
    "weight": 0.5
  }]
}

We've created a load balanced config that will route 50% of the traffic to each of the 2 prompt IDs mentioned in it. We can save this config and fetch its ID.

3. Make requests using this config

Lets use this config to start making requests from our application. We will use the prompt completions API to make the requests and add the config in our headers.

As we make these requests, they'll show up in the Logs tab. We can see that requests are being routed equally between the 2 prompts.

Let's setup feedback for these APIs so we can begin our tests!

4. Send feedback for responses

Collecting and analysing feedback allows us to find the real performance of each of these 2 prompts (an in turn gemini-pro and gpt-3.5-turbo)

The Portkey SDK allows a feedback method to collect feedback based on trace IDs. The pcompletion object in the previous request allows us to fetch the trace ID that portkey created for it.

// Get the trace ID of the request we just made
const reqTrace = pcompletion.getHeaders()["trace-id"]

await portkey.feedback.create({
    traceID: reqTrace,
    value: 1  // For thumbs up or 0 for thumbs down
});

5. Find the winner

We can now compare the feedback for the 2 prompts from our feedback dashboard

We find that the gpt-3.5-turbo prompt is at 4.71 average feedback after 20 attempts, while gemini-pro is at 4.11. While we definitely need more data and examples, let's assume for now that we wanted to start directing more traffic to it.

We can edit the weight in the config to direct more traffic to gpt-3.5-turbo. The new config would look like this:

{
  "strategy": {
    "mode": "loadbalance"
  },
  "targets": [{
    "prompt_id": "0db0d89c-c1f6-44bc-a976-f92e24b39a19",
    "weight": 0.8
  },{
    "prompt_id": "pp-blog-outli-840877",
    "weight": 0.2
  }]
}

This directs 80% of the traffic to OpenAI.

And we're done! We were able to set up an effective A/B test between prompts and models without fretting.

Next Steps

As next explorations, we could create versions of the prompts and test between them. We could also test 2 prompts on gpt-3.5-turbo to judge which one would perform better.

Try creating a prompt to create tweets and see which model or prompts perform better.

Portkey allows a lot of flexibility while experimenting with prompts.

Bonus: Add a fallback

We've noticed that we hit the OpenAI rate limits at times. In that case, we can fallback to the gemini prompt so the user doesn't experience the failure.

Adjust the config like this, and your fallback is setup!

{
  "strategy": {
    "mode": "loadbalance"
  },
  "targets": [{
    "strategy": {"mode": "fallback"},
    "targets": [
      {
        "prompt_id": "0db0d89c-c1f6-44bc-a976-f92e24b39a19",
        "weight": 0.8
      }, {
        "prompt_id": "pp-blog-outli-840877"
      }]
  },{
    "prompt_id": "pp-blog-outli-840877",
    "weight": 0.2
  }]
}

If you need any help in further customizing this flow, or just have more questions as you run experiments with prompts / models, please reach out to us at [email protected] (We reply fast!)

Last updated