gRPC (Beta)

Enterprise Feature
gRPC support is available on Enterprise self-hosted plans only. Contact the Portkey team to enable it for your gateway deployment.

gRPC support is currently in beta. The API surface may change based on feedback.

The Portkey Gateway supports gRPC as an alternative transport protocol alongside HTTP/REST. This enables lower latency, efficient binary serialization via Protocol Buffers, and native streaming support for applications that prefer gRPC communication.

How It Works

The gateway operates in two modes depending on the provider:

Mode	Description	Use Case
gRPC → HTTP Proxy	Gateway accepts gRPC requests and converts them to HTTP internally	Works with all providers
Native gRPC	Gateway connects to the provider’s native gRPC endpoint directly	Lower latency for supported providers (e.g., Google Gemini)

For providers without a native gRPC endpoint, the gateway transparently proxies gRPC requests over HTTP — so every provider supported by Portkey works out of the box. When a provider does expose a native gRPC API (currently Google Gemini), the gateway connects directly for optimal performance.

Starting the gRPC Server

Command Line Flags

# Start only the gRPC server (default port 8789)
npm start -- --llm-grpc

# Start both HTTP and gRPC servers
npm start -- --llm-node --llm-grpc

# With custom ports
npm start -- --llm-node --llm-grpc --port 8787 --grpc-port 50051

Environment Variables

Variable	Default	Description
`GRPC_PORT`	`8789`	Port for the gRPC server
`PORT`	`8787`	Port for the HTTP server (used as base URL for internal routing)

Enabling TLS

The gRPC server supports TLS using the same certificates as the HTTP server:

TLS_KEY_PATH=/path/to/key.pem \
TLS_CERT_PATH=/path/to/cert.pem \
npm start -- --llm-grpc

Authentication

Pass your Portkey API key as gRPC metadata:

Metadata Key	Description
`x-portkey-api-key`	Your Portkey API key

With the Model Catalog, the provider is specified in the model string itself (@provider_slug/model_name), so separate provider headers are typically not needed.

Making Requests

All request bodies are sent as a JSON string in the input field of the GatewayRequest message. Each endpoint returns responses in a consistent format matching the API you called, regardless of the underlying provider:

ChatCompletions and Embeddings — OpenAI-compatible format
Messages — Anthropic Messages format
Responses — OpenAI Responses API format

The gateway handles all provider-to-format translation automatically — you always get the format matching the endpoint you called, no matter which LLM is behind it.

Model String Format

Portkey uses the Model Catalog format for model strings:

@provider_slug/model_name

Examples: @openai/gpt-4o, @gemini/gemini-2.0-flash, @anthropic/claude-3-opus-20240229, @azure-openai/gpt-4

Chat Completions

grpcurl
Python (grpcio)
Node.js (@grpc/grpc-js)

# Non-streaming
grpcurl -plaintext \
  -H 'x-portkey-api-key: YOUR_PORTKEY_KEY' \
  -d '{
    "input": "{\"model\": \"@openai/gpt-4o\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}]}"
  }' \
  localhost:8789 gateway.Gateway/ChatCompletions

# Streaming
grpcurl -plaintext \
  -H 'x-portkey-api-key: YOUR_PORTKEY_KEY' \
  -d '{
    "input": "{\"model\": \"@openai/gpt-4o\", \"messages\": [{\"role\": \"user\", \"content\": \"Count from 1 to 5\"}], \"stream\": true}"
  }' \
  localhost:8789 gateway.Gateway/ChatCompletionsStream

import grpc
import json

# Generate Python stubs from proto:
# python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. gateway.proto

import gateway_pb2
import gateway_pb2_grpc

channel = grpc.insecure_channel('localhost:8789')
stub = gateway_pb2_grpc.GatewayStub(channel)

metadata = [
    ('x-portkey-api-key', 'YOUR_PORTKEY_KEY'),
]

# Non-streaming
request = gateway_pb2.GatewayRequest(
    input=json.dumps({
        "model": "@openai/gpt-4o",
        "messages": [{"role": "user", "content": "Hello!"}]
    })
)

response = stub.ChatCompletions(request, metadata=metadata)
print(f"Status: {response.status_code}")
print(f"Response: {response.body.decode('utf-8')}")

# Streaming
stream_request = gateway_pb2.GatewayRequest(
    input=json.dumps({
        "model": "@openai/gpt-4o",
        "messages": [{"role": "user", "content": "Count to 5"}],
        "stream": True
    })
)

for chunk in stub.ChatCompletionsStream(stream_request, metadata=metadata):
    print(f"Chunk: {chunk.data.decode('utf-8')}")

// Using CommonJS require() for compatibility.
// For ESM, use: import grpc from '@grpc/grpc-js';
const grpc = require('@grpc/grpc-js');
const protoLoader = require('@grpc/proto-loader');

const packageDefinition = protoLoader.loadSync('gateway.proto', {
  keepCase: true,
  longs: String,
  enums: String,
  defaults: true,
  oneofs: true,
});

const gatewayProto = grpc.loadPackageDefinition(packageDefinition).gateway;

const client = new gatewayProto.Gateway(
  'localhost:8789',
  grpc.credentials.createInsecure()
);

const metadata = new grpc.Metadata();
metadata.set('x-portkey-api-key', 'YOUR_PORTKEY_KEY');

// Non-streaming
client.ChatCompletions(
  {
    input: JSON.stringify({
      model: '@openai/gpt-4o',
      messages: [{ role: 'user', content: 'Hello!' }],
    }),
  },
  metadata,
  (error, response) => {
    if (error) {
      console.error('Error:', error);
      return;
    }
    console.log('Status:', response.status_code);
    console.log('Response:', response.body.toString('utf-8'));
  }
);

// Streaming
const call = client.ChatCompletionsStream(
  {
    input: JSON.stringify({
      model: '@openai/gpt-4o',
      messages: [{ role: 'user', content: 'Count to 5' }],
      stream: true,
    }),
  },
  metadata
);

call.on('data', (chunk) => {
  console.log('Chunk:', chunk.data.toString('utf-8'));
});

call.on('end', () => {
  console.log('Stream ended');
});

call.on('error', (error) => {
  console.error('Stream error:', error);
});

Anthropic Messages

grpcurl
grpcurl (Streaming)

grpcurl -plaintext \
  -H 'x-portkey-api-key: YOUR_PORTKEY_KEY' \
  -d '{
    "input": "{\"model\": \"@anthropic/claude-3-opus-20240229\", \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}], \"max_tokens\": 100}"
  }' \
  localhost:8789 gateway.Gateway/Messages

grpcurl -plaintext \
  -H 'x-portkey-api-key: YOUR_PORTKEY_KEY' \
  -d '{
    "input": "{\"model\": \"@anthropic/claude-3-opus-20240229\", \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}], \"max_tokens\": 100, \"stream\": true}"
  }' \
  localhost:8789 gateway.Gateway/MessagesStream

OpenAI Responses

grpcurl
grpcurl (Streaming)

grpcurl -plaintext \
  -H 'x-portkey-api-key: YOUR_PORTKEY_KEY' \
  -d '{
    "input": "{\"model\": \"@openai/gpt-4o\", \"input\": \"Tell me a joke\"}"
  }' \
  localhost:8789 gateway.Gateway/Responses

grpcurl -plaintext \
  -H 'x-portkey-api-key: YOUR_PORTKEY_KEY' \
  -d '{
    "input": "{\"model\": \"@openai/gpt-4o\", \"input\": \"Tell me a joke\", \"stream\": true}"
  }' \
  localhost:8789 gateway.Gateway/ResponsesStream

Embeddings

grpcurl -plaintext \
  -H 'x-portkey-api-key: YOUR_PORTKEY_KEY' \
  -d '{
    "input": "{\"model\": \"@openai/text-embedding-3-small\", \"input\": \"The quick brown fox jumps over the lazy dog\"}"
  }' \
  localhost:8789 gateway.Gateway/Embeddings

Health Check

grpcurl -plaintext localhost:8789 gateway.Gateway/Health

{
  "status": "success",
  "message": "Server is healthy",
  "version": "1.x.x"
}

Native gRPC Providers

When a provider exposes a native gRPC API, the gateway bypasses HTTP entirely and makes direct gRPC calls for the lowest possible latency.

Provider	Transport	Streaming	Endpoint
Google Gemini	Native gRPC	Yes	`generativelanguage.googleapis.com:443`
OpenAI	HTTP proxy	Yes	—
Anthropic	HTTP proxy	Yes	—
Azure OpenAI	HTTP proxy	Yes	—

How Native gRPC Works

When a request targets a native gRPC provider, the gateway:

Detects the gRPC transport via the x-portkey-gateway-transport: grpc header
Transforms the request into the provider’s native gRPC format
Makes a direct gRPC call to the provider’s endpoint
Transforms the response back to the format matching the endpoint you called

This eliminates HTTP/JSON serialization overhead, uses efficient binary Protocol Buffer encoding, and maintains persistent gRPC connections with client caching.

Gemini (Non-Streaming)
Gemini (Streaming)

grpcurl -plaintext \
  -H 'x-portkey-api-key: YOUR_PORTKEY_KEY' \
  -d '{
    "input": "{\"model\": \"@gemini/gemini-2.0-flash\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}]}"
  }' \
  localhost:8789 gateway.Gateway/ChatCompletions

grpcurl -plaintext \
  -H 'x-portkey-api-key: YOUR_PORTKEY_KEY' \
  -d '{
    "input": "{\"model\": \"@gemini/gemini-2.0-flash\", \"messages\": [{\"role\": \"user\", \"content\": \"Explain quantum computing\"}], \"stream\": true}"
  }' \
  localhost:8789 gateway.Gateway/ChatCompletionsStream

The gateway handles all format transformations automatically — you send requests in the format matching the endpoint (Chat Completions, Messages, or Responses) and receive responses in that same format, regardless of the underlying provider.

gRPC Service Definition

The gateway exposes a single Gateway service with the following methods:

service Gateway {
  // Health check
  rpc Health(Empty) returns (HealthResponse);

  // Embeddings (non-streaming)
  rpc Embeddings(GatewayRequest) returns (GatewayResponse);

  // Chat completions
  rpc ChatCompletions(GatewayRequest) returns (GatewayResponse);
  rpc ChatCompletionsStream(GatewayRequest) returns (stream StreamChunk);

  // Anthropic Messages
  rpc Messages(GatewayRequest) returns (GatewayResponse);
  rpc MessagesStream(GatewayRequest) returns (stream StreamChunk);

  // OpenAI Responses
  rpc Responses(GatewayRequest) returns (GatewayResponse);
  rpc ResponsesStream(GatewayRequest) returns (stream StreamChunk);
}

Message Types

message GatewayRequest {
  string input = 1;  // JSON request body
}

message GatewayResponse {
  int32 status_code = 1;
  bytes body = 2;     // JSON response body
}

message StreamChunk {
  bytes data = 1;     // SSE-formatted chunk data
}

message HealthResponse {
  string status = 1;
  string message = 2;
  string version = 3;
}

Service Discovery

The gRPC server supports reflection, enabling service discovery with tools like grpcurl:

# List all services
grpcurl -plaintext localhost:8789 list

# Describe the Gateway service
grpcurl -plaintext localhost:8789 describe gateway.Gateway

# Describe a specific method
grpcurl -plaintext localhost:8789 describe gateway.Gateway.ChatCompletions

Response Format

The GatewayResponse contains an HTTP status code and a JSON body. The response format depends on which endpoint you called — the gateway ensures consistency regardless of the underlying provider.

ChatCompletions

Returns the standard OpenAI Chat Completions format:

Non-Streaming
Streaming

{
  "id": "portkey-xxx",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "gemini-2.0-flash",
  "provider": "google",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 15,
    "total_tokens": 25
  }
}

Each StreamChunk contains SSE-formatted data:

data: {"id":"portkey-xxx","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello"},"finish_reason":null}]}

data: {"id":"portkey-xxx","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"portkey-xxx","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":10,"completion_tokens":2,"total_tokens":12}}

data: [DONE]

Messages (Anthropic)

Returns the Anthropic Messages format — even when the underlying provider is not Anthropic (e.g., calling Messages with @openai/gpt-4o):

{
  "id": "msg_xxx",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "The capital of France is Paris."
    }
  ],
  "model": "claude-3-opus-20240229",
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 14,
    "output_tokens": 10
  }
}

Responses (OpenAI Responses API)

Returns the OpenAI Responses API format:

{
  "id": "resp_xxx",
  "object": "response",
  "created_at": 1234567890,
  "model": "gpt-4o",
  "output": [
    {
      "type": "message",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "Why did the chicken cross the road? To get to the other side!"
        }
      ]
    }
  ],
  "usage": {
    "input_tokens": 8,
    "output_tokens": 18,
    "total_tokens": 26
  }
}

Response Metadata

HTTP response headers are returned as gRPC trailing metadata:

x-portkey-trace-id: xxx-xxx-xxx
x-portkey-provider: google
x-portkey-cache-status: DISABLED
x-portkey-retry-attempt-count: 0

Error Handling

Provider and request errors are returned inside the GatewayResponse.status_code field, not as gRPC status codes on the wire. The gRPC call itself will return OK unless there is a gateway infrastructure failure (e.g., server crash, proto parse error). Use the table below to interpret the status_code value in the response:

Status Code	Equivalent gRPC Status	Description
200	`OK`	Success
400	`INVALID_ARGUMENT`	Bad request
401	`UNAUTHENTICATED`	Invalid API key
403	`PERMISSION_DENIED`	Access denied
404	`NOT_FOUND`	Resource not found
429	`RESOURCE_EXHAUSTED`	Rate limited
500	`INTERNAL`	Server error
503	`UNAVAILABLE`	Service unavailable
504	`DEADLINE_EXCEEDED`	Timeout

Always check GatewayResponse.status_code to detect errors — do not rely on the gRPC call status alone.

Error responses include details in the body:

{
  "error": {
    "message": "Invalid API key",
    "type": "authentication_error",
    "code": "invalid_api_key"
  }
}

When to Use gRPC vs HTTP

Use gRPC when…	Use HTTP when…
You need the lowest possible latency	You have browser-based clients (gRPC-Web requires a proxy)
You have high-throughput streaming workloads	You need simple integrations with widely supported REST
You’re in a service-to-service architecture with protobuf	You need easy debugging with standard HTTP tools
You want native gRPC connections to providers like Google Gemini

Connection Management

The gateway maintains a cache of gRPC client connections:

Connections are reused per API key for efficiency
Stale connections are automatically cleaned up
Default timeout: 60 seconds (300 seconds for streaming)

Limitations

The following limitations apply during the beta period:

WebSocket-based realtime endpoints (/v1/realtime) are not available via gRPC
File upload/download operations use HTTP proxy mode only
Batch operations use HTTP proxy mode only
Only server-side streaming is supported (no bidirectional streaming)

Troubleshooting

Connection Refused

Error: 14 UNAVAILABLE: Connection refused

Ensure the gRPC server is running with the --llm-grpc flag:

npm start -- --llm-grpc

Authentication Errors

Error: 16 UNAUTHENTICATED

Verify your Portkey API key is correctly passed as gRPC metadata:

grpcurl -H 'x-portkey-api-key: YOUR_KEY' ...

INVALID_ARGUMENT Errors

Error: 3 INVALID_ARGUMENT

Check that your request JSON is valid and properly escaped in the input field. Ensure the model string follows the @provider_slug/model_name format.

Timeout Issues

For long-running requests, increase the client-side timeout:

Python
Node.js

response = stub.ChatCompletions(
    request,
    metadata=metadata,
    timeout=120.0  # 120 seconds
)

client.ChatCompletions(
  request,
  metadata,
  { deadline: Date.now() + 120000 },
  callback
);

Introduction

Product

Self-Hosting

Support

How It Works

Starting the gRPC Server

Command Line Flags

Environment Variables

Enabling TLS

Authentication

Making Requests

Model String Format

Chat Completions

Anthropic Messages

OpenAI Responses

Embeddings

Health Check

Native gRPC Providers

How Native gRPC Works

gRPC Service Definition

Message Types

Service Discovery

Response Format

ChatCompletions

Messages (Anthropic)

Responses (OpenAI Responses API)

Response Metadata

Error Handling

When to Use gRPC vs HTTP

Connection Management

Limitations

Troubleshooting

Connection Refused

Authentication Errors

INVALID_ARGUMENT Errors

Timeout Issues

Introduction

Product

Self-Hosting

Support

​How It Works

​Starting the gRPC Server

​Command Line Flags

​Environment Variables

​Enabling TLS

​Authentication

​Making Requests

​Model String Format

​Chat Completions

​Anthropic Messages

​OpenAI Responses

​Embeddings

​Health Check

​Native gRPC Providers

​How Native gRPC Works

​gRPC Service Definition

​Message Types

​Service Discovery

​Response Format

​ChatCompletions

​Messages (Anthropic)

​Responses (OpenAI Responses API)

​Response Metadata

​Error Handling

​When to Use gRPC vs HTTP

​Connection Management

​Limitations

​Troubleshooting

​Connection Refused

​Authentication Errors

​INVALID_ARGUMENT Errors

​Timeout Issues

How It Works

Starting the gRPC Server

Command Line Flags

Environment Variables

Enabling TLS

Authentication

Making Requests

Model String Format

Chat Completions

Anthropic Messages

OpenAI Responses

Embeddings

Health Check

Native gRPC Providers

How Native gRPC Works

gRPC Service Definition

Message Types

Service Discovery

Response Format

ChatCompletions

Messages (Anthropic)

Responses (OpenAI Responses API)

Response Metadata

Error Handling

When to Use gRPC vs HTTP

Connection Management

Limitations

Troubleshooting

Connection Refused

Authentication Errors

INVALID_ARGUMENT Errors

Timeout Issues