For your AI app, rate limits are even more stringent, and if you start hitting the providers’ rate limits, there’s nothing you can do except wait to cool down and try again. With Portkey, we help you solve this very easily. This cookbook will teach you how to utilize Portkey to distribute traffic across multiple LLMs, ensuring that your loadbalancer is robust by setting up backups for requests. Additionally, you will learn how to load balance across OpenAI and Anthropic, leveraging the powerful Claude-3 models recently developed by Anthropic, with Azure serving as the fallback layer. Prerequisites: You should have the Portkey API Key. Please sign up to obtain it. Additionally, you should have added the OpenAI, Azure OpenAI, and Anthropic providers to Model Catalog.Documentation Index
Fetch the complete documentation index at: https://docs.portkey.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
1. Import the SDK and authenticate Portkey
Start by installing theportkey-ai to your NodeJS project.
2. Create Configs: Loadbalance with Nested Fallbacks
Portkey acts as AI gateway to all of your requests to LLMs. It follows the OpenAI SDK signature in all of it’s methods and interfaces making it easy to use and switch. Here is an example of an chat completions requests through Portkey.loadbalance strategy across Anthropic and OpenAI. weight describes the traffic should be split into 50/50 among both the LLM providers while override_params will help us override the defaults.
Let’s take this a step further to apply a fallback mechanism for the requests from* OpenAI* to fallback to Azure OpenAI. This nested mechanism among the targets will ensure our app is reliable in the production in great confidence.
See the documentation for Portkey Fallbacks and Loadbalancing.
3. Make a Request
Now that theconfig ‘s are concrete and are passed as arguments when instantiating the Portkey client instance, all subsequent will acquire desired behavior auto-magically — No additional changes to the codebase.
4. Trace the request from the logs
It can be challenging to identify particular requests from the thousands that are received every day, similar to trying to find a needle in a haystack. However, Portkey offers a solution by enabling us to attach a desired trace ID. Hererequest-loadbalance-fallback.
5. Advanced: Canary Testing
Given there are new models coming every day and your app is in production — What is the best way to try the quality of those models? Canary Testing allows you to gradually roll out a change to a small subset of users before making it available to everyone. Consider this scenario: You have been using OpenAI as your LLM provider for a while now, but are considering trying an open-source Llama model for your app through Anyscale.weight , indication of traffic is split to have 10% of your user-base are served from Anyscale’s Llama models. Now, you are all set up to get feedback and observe the performance of your app and release increasingly to larger userbase.
Considerations
You can implement production-grade Loadbalancing and nested fallback mechanisms with just a few lines of code. While you are equipped with all the tools for your next GenAI app, here are a few considerations:- Every request has to adhere to the LLM provider’s requirements for it to be successful. For instance,
max_tokensis required for Anthropic and not for OpenAI. - While loadbalance helps reduce the load on one LLM - it is recommended to pair it with a Fallback strategy to ensure that your app stays reliable
- On Portkey, you can also pass the loadbalance weight as 0 - this will essentially stop routing requests to that target and you can amp it up when required
- Loadbalance has no target limits as such, so you can potentially add multiple account details from one provider and effectively multiply your available rate limits
- Loadbalance does not alter the outputs or the latency of the requests in any way
See the entire code
See the entire code

