A/B testing with large language models in production is crucial for driving optimal performance and user satisfaction.
gpt-3.5-turbo
model and Google’s gemini-pro
models which have similar pricing and benchmarks. We will rely on user feedback metrics to pick a winner.
Setting it up will need us to
title
and num_sections
which we’ll populate through the API later on.
system
prompt, so we can ignore it and create a prompt like this.
id
for both these prompts from our Prompts list page and will use them in our config. This is what it finally looks like.
Create the config and fetch the ID
gemini-pro
and gpt-3.5-turbo
)
The Portkey SDK allows a feedback
method to collect feedback based on trace IDs. The pcompletion object in the previous request allows us to fetch the trace ID that portkey created for it.
gpt-3.5-turbo
prompt is at 4.71 average feedback after 20 attempts, while gemini-pro
is at 4.11. While we definitely need more data and examples, let’s assume for now that we wanted to start directing more traffic to it.
We can edit the weight
in the config to direct more traffic to gpt-3.5-turbo
. The new config would look like this:
gpt-3.5-turbo
to judge which one would perform better.
Try creating a prompt to create tweets and see which model or prompts perform better.
Portkey allows a lot of flexibility while experimenting with prompts.