Simplifying LLM batch inference

LLM batch inference promises lower costs and fewer rate limits, but providers make it complex. See how Portkey simplifies batching with a unified API, direct outputs, and transparent pricing.

AI teams often need to process tens of thousands or even millions of requests at once. Running each request individually in real time isn’t practical when the workload is offline, non-urgent, or simply too large. This is where batching comes in.

Batching allows developers to group requests into a single job that runs asynchronously in the background. Providers like OpenAI, Azure, Bedrock, and Vertex support batching for exactly this reason: it’s more efficient for them to process bulk jobs, and they pass those efficiencies on to customers.

The two biggest advantages of batching are clear:

  • Lower costs: Providers typically offer steep discounts (up to 50%) on batched jobs compared to real-time requests.
  • Bypassing rate limits: Batch workloads don’t count against the strict per-second limits applied to inference APIs, freeing up quota for live applications.

For use cases like analytics, embeddings, or evaluations—where results can wait a few hours batching is the natural choice. But while the concept is simple, the execution today is anything but.

Challenges with batching today

While batching promises efficiency, actually running batches directly with providers is a frustrating experience. Developers run into the same hurdles every time:

  1. Provider-specific file uploads
    Each provider has its own rules for batching. Teams have to reformat data and manage uploads separately for each provider.
  2. Continuous monitoring
    Once a batch is submitted, there’s no simple “set it and forget it.” Developers must monitor the provider’s APIs to check if the batch has started, is still in progress, or has completed—often for jobs that can take 12–24 hours.
  3. Getting outputs is a hassle
    Even when batches complete, retrieving results isn’t straightforward. Providers return outputs in their own formats, often requiring developers to track file IDs, download results manually, and map them back to original inputs.
  4. Complexity with different providers
    Many providers don’t even offer file upload APIs. Instead, users must upload data a separate resource (e.g. S3 bucket) and pass the reference. In large organizations, granting access is a governance and security hassle.
  5. Opaque pricing
    Providers advertise discounts (like 50% off for batches), but don’t show exact cost or token usage per batch. Teams often run jobs without knowing the final spend until later, making it hard to track budgets or justify usage.

What should be a straightforward way to run large-scale workloads ends up being a patchwork of provider-specific workarounds, manual monitoring, and governance risks.

How Portkey’s AI gateway simplifies batching

Instead of juggling provider-specific quirks and manual steps, with Portkey’s AI gateway, teams get a unified, automated workflow for handling batches from start to finish.

  1. Unified file handling
    Upload your data once using a standard OpenAI-compatible format. Portkey automatically transforms and uploads it to the right provider format, whether that’s OpenAI, Azure, Bedrock, or Vertex. Also, if you’re doing batching for evals, you can use the same file across multiple providers.
  2. Automatic monitoring
    Portkey tracks the progress of your batch jobs in the background. You don’t need to keep polling provider APIs or manually check statuses, everything is monitored for you.
  3. Direct batch output API
    Instead of chasing file IDs or downloading from a different resource like S3/GCS, you can call Portkey’s batch output endpoint to fetch results directly. Outputs are normalized into a consistent OpenAI-style format, no matter which provider ran the job.
  4. Cost observability
    Portkey removes the pricing black box. Every batch request is logged with accurate cost details (including provider discounts) in the Portkey dashboard, so teams can track usage just as easily as they do with inference calls.
  5. Governance and security
    By handling uploads and outputs centrally, Portkey eliminates the need to expose raw S3 or GCS buckets to every user. Enterprises can batch at scale without introducing new access risks or governance gaps.

With Portkey, batching becomes as simple as create → monitor → fetch, instead of a multi-step workflow that varies by provider.

Advanced batching with Portkey

Beyond simplifying provider batches, Portkey adds capabilities that go further than what providers offer natively. These make batching flexible enough to fit both asynchronous and near real-time workloads.

  1. Immediate batches
    Instead of waiting 12–24 hours for provider-side batches, Portkey can process large jobs in “immediate” mode. Requests are split and executed in parallel through the Portkey Gateway, then returned in the same batch-style format. You can even distribute batch traffic across multiple providers using configs, for example, sending 30% to Azure, 40% to OpenAI, and 30% to Bedrock, helping reduce rate-limit errors while retaining full flexibility in model selection.
  • Useful for embeddings, RAG pipelines, or evaluations where results are needed quickly.
  • Supports per-request model selection, something providers don’t allow in their native batch APIs.
  • Automatically retries failed requests and spreads workload over time to reduce rate-limit errors.

With Portkey’s model catalog, you can also provide access to only specific models, making it easy to govern usage.

  1. Eval-style workflows
    With immediate batches, Portkey can power large-scale evaluation runs out of the box. Upload a file of prompts, run thousands of completions, and receive outputs in a structured, normalized format, without building a separate eval system.
  2. Retries and workload distribution
    With Portkey, you can add retry configurations to your batches to control how workloads are distributed and prevent overwhelming providers with rate-limit errors. This removes the need for teams to implement custom retry logic or pacing in their own pipelines.

Closing: a unified workflow for batching

Batching is complex—different file formats, manual monitoring, access issues to resources, no visibility into usage and unclear pricing. Portkey takes that complexity away with a unified, observable, and governance-friendly batching workflow that works across providers and scales with your needs.

If your team is dealing with high-volume AI workloads, book a demo with us to see how Portkey can simplify batching in your environment.