29 Nov
Summary

Teams from Springworks and Haptik shared hard-won insights from running LLMs in production: Gemini outperforms gpt-4o for Hinglish translation, and shifting to managed Gateways cuts latency in half. Plus practical tips on caching and RAG optimization at scale.

Attendees
Notes

On Production Patterns

  • Haptik & Springworks map Portkey virtual keys to their model deployments, making it simple for engineers to prototype & build AI features
  • Monitor Portkey analytics to understand deployment behavior and pre-scale resources to avoid rate limits
  • For secure testing, use short-lived virtual keys instead of sharing long-term access

Some Learnings

  • Infrastructure insight: Each additional middleware layer (auth, rate limiting) compounds latency at scale - consider using Gateway features directly instead of custom layers
  • Plan for caching early: Auxiliary services inevitably add latency at scale - implement caching in your initial development cycle
  • In RAG pipelines, Vector DB operations become bottlenecks before LLM calls - optimize these first
  • For Hinglish audio translations, especially with noise, Gemini proves more reliable than gpt-4o