Prompt Caching
OpenAI now offers prompt caching, a feature that can significantly reduce both latency and costs for your API requests. This feature is particularly beneficial for prompts exceeding 1024 tokens, offering up to an 80% reduction in latency for longer prompts over 10,000 tokens.
Prompt Caching is enabled for following models
gpt-4o (excludes gpt-4o-2024-05-13)
gpt-4o-mini
o1-preview
o1-mini
Portkey supports OpenAI’s prompt caching feature out of the box. Here is an examples on of how to use it:
What can be cached
- Messages: The complete messages array, encompassing system, user, and assistant interactions.
- Images: Images included in user messages, either as links or as base64-encoded data, as well as multiple images can be sent. Ensure the detail parameter is set identically, as it impacts image tokenization.
- Tool use: Both the messages array and the list of available
tools
can be cached, contributing to the minimum 1024 token requirement. - Structured outputs: The structured output schema serves as a prefix to the system message and can be cached.
What’s Not Supported
- Completions API (only Chat Completions API is supported)
- Streaming responses (caching works, but streaming itself is not affected)
Monitoring Cache Performance
Prompt caching requests & responses based on OpenAI’s calculations here:
All requests, including those with fewer than 1024 tokens, will display a cached_tokens
field of the usage.prompt_tokens_details
chat completions object indicating how many of the prompt tokens were a cache hit.
For requests under 1024 tokens, cached_tokens
will be zero.
Key Features:
- Reduced Latency: Especially significant for longer prompts.
- Lower Costs: Cached portions of prompts are billed at a discounted rate.
- Improved Efficiency: Allows for more context in prompts without increasing costs proportionally.
- Zero Data Retention: No data is stored during the caching process, making it eligible for zero data retention policies.
Was this page helpful?