LLMs are costly to run. As their usage increases, the providers have to balance serving user requests v/s straining their GPU resources too thin. They generally deal with this by putting rate limits on how many requests a user can send in a minute or in a day.
tts-1-hd
from OpenAI, you can not send more than 7 requests in minute. Any extra request automatically fails.
There are many real-world use cases where it’s possible to run into rate limits:
LLM Provider | Example Model | Rate Limits |
---|---|---|
OpenAI | gpt-4 | Tier 1:500 Requests per Minute10,000 Tokens per Minute10,000 Requests per Day |
Anthropic | All models | Tier 1:50 RPM50,000 TPM1 Million Tokens per Day |
Cohere | Co.Generate models | Production Key:10,000 RPM |
Anyscale | All models | Endpoints:30 concurrent requests |
Perplexity AI | mixtral-8x7b-instruct | 24 RPM16,000 TPM |
Together AI | All models | Paid:100 RPM |
chat.completions
call using the Portkey SDK:
strategy
is set as fallback
on_status_codes
param ensures that the fallback is only triggered on the 429
error code, which is generated for rate limit errorstargets
array contains the details of the LLMs and the order of the fallbackoverride_params
in the second target lets you add more params for the specific provider. (max_tokens
for Anthropic in this case)strategy
is set as loadbalance
targets
contain 3 different OpenAI API keys from 3 different accounts, all with equal weight - which means Portkey will split the traffic 1/3rd equally among the 3 keys