The Gateway Grew Up

So did the problems it was built to solve.

There's a point where something stops being a side project and becomes infrastructure. The thing you were "just trying out" is now what your business runs on. The question shifts from can we build with AI? to can we make sure it never breaks?

Most AI failures in production aren't hallucinations. They're embarrassingly basic. A provider goes down and nobody notices for three hours. A new agent ships and blows past the entire GenAI budget with zero visibility. Engineers send proprietary data to external models because no guardrail exists. These aren't AI problems. They're infrastructure problems.

That's the gap Gateway was built for: a single, reliable layer that every AI request flows through. Universal API. Fallbacks. Load balancing. Retries. Conditional routing. Guardrails. 250+ LLMs, unified. The basics, done right, at scale.

24,000+ organizations started running production AI on it. $180M+ in AI spend managed. 1T tokens processed every day.

Today, we're open sourcing Portkey’s Production Gateway.

It used to be simple...

For most of AI's production history, the model was simple: user sends prompt, model returns response. You log it, cache it, move on.

Agents broke this.

An agent doesn't call a model once. It calls tools. It calls other agents. It loops, plans, executes code. And increasingly, those tools live on MCP servers — microservice-like things that let models interact with your Slack, your database, your internal APIs.

This introduced a whole new surface area of failure. Who authenticated that MCP server? Which version is it running? If your agent calls a tool that hasn't been updated in six months but your model's behavior changed last week, what breaks? And where in the chain do you even look?

Teams scaling agentic workflows hit the same wall: they needed a registry to manage MCP servers, an auth layer for MCP traffic, circuit breakers to isolate failures, and metrics that don't go stale as new models ship. None of it existed as one thing.

So we built it.

What's new in the OSS Production Gateway?

Everything that used to require a SaaS subscription - circuit breaker, semantic cache, budget limits, model catalog, metadata governance, config management - is now added to our open source. No license keys. No upgrade prompts.

Here's what moved, and what's new on top of it:

  • MCP Registry : Track, version, and manage MCP servers in one place. Deprecated endpoints fail clearly, not silently.
  • OAuth 2.1 for MCP : PKCE flow. No more hardcoded API keys in environment variables.
  • Circuit Breakers : Configurable on P99 latency or error rate. Probe requests test recovery before traffic resumes. Fails over to cache, alternate provider, or a clean error.
  • Usage Policies : Limits at the request, token, or cost level, enforced at ingress before the model is ever called.
  • Model Catalog : Every model, every provider, pricing included. Stays current so you don't have to chase provider docs.
  • Metrics : Real-time cost, latency, and usage - updated as new models and pricing land.

For teams with serious security and compliance requirements, the On-Prem Enterprise Gateway adds gRPC, SSO, SCIM, AWS KMS, RBAC, JWT, audit logs, multi-workspace, and full SOC2/GDPR/HIPAA compliance.

All of it. Open source.

The agentic ecosystem is moving fast - a new model lands every few weeks, a new provider every month. We've been running this at scale for three years, processing billions of tokens, across dozens of providers, for hundreds of teams - and every failure mode we've hit is hardened into the codebase.

Fork it. Self-host it. Build on it. We'll see you in the issues :)