OpenAI’s Prompt Caching: A Deep Dive

OpenAI’s Prompt Caching: A Deep Dive

This update is welcome news for developers who have been grappling with the challenges of managing API costs and response times. OpenAI's Prompt Caching introduces a mechanism to reuse recently seen input tokens, potentially slashing costs by up to 50% and dramatically reducing latency for repetitive tasks.

In this post, we'll dive deep into OpenAI's new Prompt Caching feature, explore its synergies with Portkey's caching system, and provide you with actionable insights on how to optimize your caching strategies. Whether you're a seasoned AI developer or just starting to explore the possibilities of large language models, understanding these new dynamics will be crucial for staying ahead.

💡
Refer to OpenAI Prompt Caching documentation for our user guide.

OpenAI’s Prompt Caching

OpenAI has introduced Prompt Caching for its latest and most advanced models:

  1. gpt-4o (excludes gpt-4o-2024-05-13 and chatgpt-4o-latest)
  2. gpt-4o-mini
  3. o1-preview
  4. O1-mini

Prompt Caching is also available for fine-tuned versions of these models.

Caching is enabled automatically for prompts that are 1024 tokens or longer. The system caches in 128-token increments beyond the initial 1,024 tokens. When you make an API request, the following steps occur:

  • Cache Lookup: The system checks if the initial portion (prefix) of your prompt is stored in the cache.
  • Cache Hit: If a matching prefix is found, the system uses the cached result. This significantly decreases latency and reduces costs.
  • Cache Miss: If no matching prefix is found, the system processes your full prompt. After processing, the prefix of your prompt is cached for future requests.

Cached prefixes generally remain active for 5 to 10 minutes of inactivity. However, during off-peak periods, caches may persist for up to one hour.

Types of Cacheable Content

OpenAI's caching system is versatile and capable of caching various types of content:

  • Messages: Entire message arrays, including system, user, and assistant interactions.
  • Images: Both linked and base64-encoded images within user messages.
  • Tool Use: Messages array and available tools list.
  • Structured Outputs: Schemas serving as prefixes to system messages.

OpenAI offers a significant 50% discount on cached input tokens compared to uncached input tokens. This discount can potentially halve the cost of repetitive prompts, helping developers working with large-scale applications or those requiring frequent, similar queries.

Pricing Table for Different Models

Model

Uncached Input Tokens

Cached Input Tokens

Output Tokens

GPT-4o

$2.50

$1.25

$10.00

GPT-4o (fine-tuned)

$3.75

$1.875

$15.00

GPT-4o mini

$0.15

$0.075

$0.60

GPT-4o mini (fine-tuned)

$0.30

$0.15

$1.20

o1-preview

$15.00

$7.50

$60.00

o1-mini

$3.00

$1.50

$12.00

Monitoring Cache Usage

1. cached_tokens Value in API Response

Developers can monitor cache usage through the cached_tokens value in the API response:

{
  "usage": {
    "total_tokens": 2306,
    "prompt_tokens": 2006,
    "completion_tokens": 300,
    "prompt_tokens_details": {
      "cached_tokens": 1920,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0
    }
}

Best Practices for OpenAI's Prompt Caching

Optimizing Prompt Structure

  1. Front-load Static Content: Place unchanging parts at the beginning of your prompt to increase cache hit likelihood.
  2. Standardize Templates: Use consistent structures for similar tasks to promote reusability.
  3. Structured Inputs: Utilize JSON or similar formats for consistent data presentation.

Maximizing Cache Hits

  1. Batch Processing: Group similar requests to increase consecutive cache hits.
  2. Token Threshold Awareness: Aim for prompts exceeding 1,024 tokens for caching benefits.
  3. Request Timing: Consider the 5-10 minute cache lifetime when scheduling batch processes.

Performance Monitoring and Optimization

  1. Track Metrics: Monitor cache hit rates using the cached_tokens value in API responses.
  2. Analyze and Adjust: Regularly review cache misses and adjust prompt structures accordingly.
  3. Continuous Improvement: Implement A/B testing for prompt structures and conduct periodic audits of common API calls.

Portkey User Guide: How OpenAI's New Caching Affects You

As a Portkey user, you might be wondering how OpenAI's new Prompt Caching feature impacts your experience. Here's what you need to know:

Potential Benefits

  1. More coverage: Portkey covers a broader range of models, not limiting to OpenAI
  2. Possible Cost Savings: OpenAI's 50% discount on cached input tokens might lead to reduced costs for certain requests, especially for longer prompts.
  3. Improved Performance: You may notice even faster response times for some queries, as caching occurs both at OpenAI's level and through Portkey.
  4. Enhanced Caching for Specific Models: If you frequently use GPT-4o, GPT-4o mini, o1-preview, or o1-mini, you might see improved caching performance for prompts over 1,024 tokens.

What Stays the Same

  1. Semantic Caching: Portkey's semantic caching uses contextual similarity between the input and a cached request, no specific structure required.
  2. Longer Cache Duration: Portkey's ability to cache responses for up to 90 days remains a significant advantage over OpenAI's 1-hour maximum.
  3. Cache Control: You retain the ability to use Portkey's cache namespacing and force refresh features for fine-grained control over your cached data.

Visit Portkey’s Caching Documentation for more details.

What You Might Notice

  1. Varied Performance Improvements: Depending on your usage patterns, you might see more consistent performance improvements, especially for repeated, lengthy prompts on supported OpenAI models.
  2. Potential Changes in Analytics: Your cache hit rates and cost savings in Portkey's analytics might change as some caching now occurs at OpenAI's level.
  3. Model-Specific Differences: You might notice more significant improvements when using the specific OpenAI models that support their new caching feature.

Best Practices

  1. Monitor Your Usage: Keep an eye on your Portkey analytics to understand how the new caching system affects your specific use case.
  2. Optimize Prompt Structure: For supported OpenAI models, consider structuring prompts with static content at the beginning to maximize caching benefits.
  3. Adjust Cache Settings: Experiment with Portkey's cache settings to find the optimal balance between the two caching systems for your specific needs.

Optimizing Caching Strategies: OpenAI + Portkey

  1. Utilize OpenAI's simple caching for structured queries - this can serve repetitive queries in that short duration
  2. When OpenAI's cache misses, automatically fall back to Portkey's semantic cache, which can use other models as well.

The introduction of OpenAI's native caching, combined with Portkey's existing capabilities, opens up a new frontier in AI application development. As a developer, you now have unprecedented tools at your disposal to create faster, more efficient, and more cost-effective AI-powered applications.