Return Repeat Requests from Cache
If you have multiple users of your GenAI app triggering the same or similar queries to your models, fetching LLM response from the models can be slow and expensive.
This is because it requires multiple round trips from your app to the model and you may end up paying for the duplicate queries.
To avoid such unnecessary LLM requests, you can use Portkey as your first line of defense. It is highly effective and can be made to work across the 100+ LLMs it supports by simply making changes to a few lines of code.
How Portkey Cache Works
All requests that have caching enabled on them will serve the subsequent responses from the Portkey’s cache.
Portkey offers two main ways of Caching techniques to enable on your requests — Simple and Semantic.
In short:
- Simple caching refers for identical input prompts to serve from cache.
- Semantic caching refers to an similarity threshold (uses cosine similarity) to serve from cache.
For detailed information, check out this blog post.
1. Import and Authenticate Portkey Client SDK
You now have a brief mindmap of Portkey’s approach to caching responses from LLMs.
Let’s utilize the Portkey Client SDK to send chat completion requests and attach gateway configs, which in turn activate caching.
To install it, type the following in your NodeJS environment:
Instantiate Portkey instance
At this point, it’s essential to understand that you instantiate the portkey
instance with apiKey
and virtualKey
parameters. You can find the arguments for both of them in your Portkey Dashboard.
Visit the reference to obtain the Portkey API key and learn how to create Virtual Keys.
2. Use Gateway Configs to enable Caching
The AI gateway caches your requests and serves it respecting the gateway configs on the request headers. The configs are a simple JS object or JSON string that contains following key-value pairs.
The mode
key specifies the desired strategy of caching you want for your app.
Next up, attach these configs to the request using Portkey SDK. The SDK accepts an config
parameter that can accept these configurations as an argument. To learn about more ways, refer to the 101 on Gateway Configs.
3. Make API calls, Serve from Cache
We are now ready to put what we’ve learned so far into action. We plan on making two requests to an OpenAI model (as an example) while one of them has simple caching activated on it, while other has semantic caching enabled.
Whereas for semantic caching,
On the console:
Try experimenting with rephrasing the prompts in the messages
array and see if you notice any difference in the time it takes to receive a response or the quality of the response itself.
Can you refresh the cache on demand? Yes, you can!
Can you control how long the cache remains active? Absolutely!
Explore the docs on caching to know all the features available to control how you cache the LLM responses.
4. View Analytics and Logs
On the Analytics page, you can find Portkey’s cache performance analytics under the Cache tab.
The Logs page displays a list of LLM calls that served responses from cache. The corresponding icon is activated when the cache is hit.
Next steps
By leveraging simple and semantic caching, you can avoid unnecessary LLM requests, reduce latency, and provide a better user experience. So go ahead and experiment with the Portkey Cache in your own projects – the benefits are just a few lines of code away!
Some suggestions to experiment:
- Try using the configs from the Portkey UI as a reference.
- Implement caching when there are multiple targets in your gateway configs. (Here’s a clue)
Was this page helpful?