How Snorkel AI debugs multi-agent evaluations at scale with Portkey

How Snorkel AI debugs multi-agent evaluations at scale with Portkey

A single LLM call is easy to debug: check the prompt, check the response. But when agents call sub-agents that spawn tools that branch on intermediate results, failures become invisible. Snorkel found a way to see the whole tree.

A single LLM call is easy to debug: check the prompt, check the response. But when agents call sub-agents that spawn tools that branch on intermediate results, failures become invisible. Snorkel found a way to see the whole tree.

About

Snorkel AI provides expert data annotation for AI research, partnering with foundation model labs like Anthropic, Google, and Amazon on post-training and evaluation tasks.

Industry

AI Data Research

Company Size

North America

Headquarters

Netherlands, Europe

Why Portkey:

AI governance, EU-first routing, budget enforcement, multi-provider, access and cost control

0%
increase in eval accuracy
0x
Faster Debugging
The agent debugging black box

At Snorkel AI, multi-agent systems power evaluation workflows that review expert-generated labels for specialized domains. Their Multi-Agent Question-Answer Validator alone processes thousands of completions per week, verifying annotations that ultimately power evaluation datasets, RLHF pipelines, and reasoning benchmarks.

But when something went wrong, finding where and why felt like archaeology.

The team's debugging workflow was fragmented: raw logs scattered across databases and S3, manual reconstruction of agent execution paths, no visualization of the decision tree, and aggregate-only metrics that obscured issues with individual trajectories.

"Prior to Portkey, Snorkel debugged LLM applications with a fragmented patchwork of logs in s3, DB queries, and manual visualization. The clean Portkey UI allows us to identify, debug, and rebuild agents with confidence."

— Shae Selix, Staff Engineer - TLM, Snorkel AI

Finding the right solution

Multi-agent systems introduce unique debugging challenges. Unlike single LLM calls where you simply check the prompt and response, agent systems involve hierarchical execution, parallel tool invocations, inter-agent state passing, and conditional branching based on intermediate results.

Portkey stood out for its:

  • Visual hierarchy showing every agent call, sub-agent, and tool invocation as a nested tree

  • Individual prompt-response inspection at any node in the execution tree

  • Custom tagging and filtering to identify patterns across traces

  • Side-by-side trace comparison to understand why good and bad cases diverge

Making agent execution visible

With Portkey, the team could finally see what was happening inside their agents. Every call, every sub-agent, every tool invocation appeared as a nested tree, no more reconstructing execution paths from scattered logs.

The difference was immediate. In the failing trace, they could see the exact moment things went wrong: the Planner received a chart with no question, decided to search the web for context, and spiraled into 38 calls chasing information that didn't exist. Each step was visible in the hierarchy.

Successful traces told a different story, a clean path from Planner to Code Agent to Verdict, with the right tools firing in the right order. Comparing the two side-by-side revealed exactly where execution diverged.

The impact

After implementing Portkey for agent debugging, Snorkel saw tangible improvements across their evaluation workflows.

  • 20% accuracy boost in agent evaluations by catching and fixing edge cases that were previously invisible in aggregate metrics.

  • 2x faster issue detection in development—problems surface immediately in traces rather than requiring hours of log archaeology.

  • Faster iteration cycles: What used to take hours of manual reconstruction now takes minutes of visual inspection.

"With Portkey, every agent, sub-agent, and tool call is visible in one hierarchy — you can see how decisions branch and converge. Seeing the full execution tree in one view completely changed how we debug"

— Shae Selix, Staff Engineer - TLM, Snorkel AI

See what Portkey can do for your AI stack
Lessons for teams building production agents

Instrument early. Don't wait until you have problems to start tracing. The cost of instrumentation is negligible compared to the regret of looking for log details that never persisted.

  • Tag everything. Metadata is cheap; confusion is expensive. Tag traces with task type, input characteristics, and outcome labels so you can filter and analyze patterns later.

  • Be mindful of context. When executing tools that return large amounts of tokens, like web search, weigh tradeoffs between introducing relevant information and context rot. For Snorkel's verifier agent, extra web context was a distraction that increased hallucination.

  • Visual debugging is non-negotiable. Multi-agent systems are inherently hierarchical and branching. Flat logs are the wrong data structure for understanding them.

As AI systems become more agentic, having clear visibility into execution paths will be the difference between confident deployment and constant firefighting.

If you'd like to see what Portkey can do for you, book a demo with us today.

Build your AI app's
control panel now

Build your AI app's control panel now

Build your AI app's
control panel now

AI Gateway, Observability, Guardrails, Governance, and Prompt Management, all in one platform.

AI Gateway, Observability, Guardrails, Governance, and Prompt Management, all in one platform.

Logo

Product

Solutions

Developers

Resources

...
Logo
...