If you prefer to follow along a python notebook, you can find that here.
The Test
We want to test the blog outline generation capabilities of OpenAI’sgpt-3.5-turbo
model and Google’s gemini-pro
models which have similar pricing and benchmarks. We will rely on user feedback metrics to pick a winner.
Setting it up will need us to
- Create prompts for the 2 models
- Write the config for a 50-50 test
- Make requests using this config
- Send feedback for responses
- Find the winner
1. Create prompts for the 2 models
Portkey makes it easy to create prompts through the playground. We’ll start by clicking Create on the Prompts tab and create the first prompt for OpenAI’s gpt-3.5-turbo.You’ll notice that I’d already created virtual keys for OpenAI and Google in my account. You can create them by going to the Virtual Keys tab and adding your API keys to Portkey’s vault - this also ensures that your original API keys remain secure.
title
and num_sections
which we’ll populate through the API later on.

system
prompt, so we can ignore it and create a prompt like this.

2. Write the config for a 50-50 test
To run the experiment, lets create a config in Portkey that can automatically route requests between these 2 prompts. We pulled theid
for both these prompts from our Prompts list page and will use them in our config. This is what it finally looks like.

Create the config and fetch the ID
3. Make requests using this config
Lets use this config to start making requests from our application. We will use the prompt completions API to make the requests and add the config in our headers.4. Send feedback for responses
Collecting and analysing feedback allows us to find the real performance of each of these 2 prompts (an in turngemini-pro
and gpt-3.5-turbo
)
The Portkey SDK allows a feedback
method to collect feedback based on trace IDs. The pcompletion object in the previous request allows us to fetch the trace ID that portkey created for it.
5. Find the winner
We can now compare the feedback for the 2 prompts from our feedback dashboard
gpt-3.5-turbo
prompt is at 4.71 average feedback after 20 attempts, while gemini-pro
is at 4.11. While we definitely need more data and examples, let’s assume for now that we wanted to start directing more traffic to it.
We can edit the weight
in the config to direct more traffic to gpt-3.5-turbo
. The new config would look like this:
Next Steps
As next explorations, we could create versions of the prompts and test between them. We could also test 2 prompts ongpt-3.5-turbo
to judge which one would perform better.
Try creating a prompt to create tweets and see which model or prompts perform better.
Portkey allows a lot of flexibility while experimenting with prompts.