Building an LLM-as-a-Judge System for AI (Customer Support) Agent
Before reading this guide: We recommend checking out Hamel Husain’s excellent post on LLM-as-a-Judge. This cookbook implements the principles discussed in Hamel’s post, providing a practical walkthrough of building LLM-as-a-judge evaluation .
Introduction
AI-powered customer support agents are great, but how do you ensure they provide high-quality responses at scale?
You need a system that can automatically evaluate customer support interactions by analyzing both the customer’s query and the AI agent’s response. This system should determine whether the response meets quality standards, provide a detailed critique explaining the reasoning behind the judgment, and scale easily to run tests on thousands of interactions.
Quality assurance for customer support interactions is critical but increasingly challenging as AI Agents handle more customer conversations. Manual reviews are great but they don’t scale.
The “LLM-as-a-Judge” approach offers a powerful solution to this challenge. This guide will show you how to build an automated evaluation system that scales to thousands of interactions. By the end, you’ll have a robust workflow that helps you improve AI agents responses.
What We’re Building
We’ll create an LLM as a judge workflow that evaluates customer support interactions by analyzing both the customer’s query and the AI agent’s response. For each interaction, our system will:
- Determine whether the response meets quality standards (pass/fail)
- Provide a detailed critique explaining the reasoning behind the judgment
- Scale easily to run tests on thousands of interactions
Use Case Example:
Imagine you’re building a customer support AI agent. Your challenges include:
- Ensuring consistent quality across all AI responses
- Identifying patterns of problematic responses
- Maintaining security and compliance standards
- Quickly detecting when the AI is providing incorrect information
With an LLM-as-a-Judge system, you can:
- Get specific feedback on why responses fail to meet standards
- Identify trends and systematic issues in your support system
- Provide targeted training and improvements based on detailed critiques
- Quickly validate whether changes to your AI agent have improved response quality
System Architecture: How It Works
Industry Best Practices for AI Agent Evaluation
Before diving into implementation, let’s briefly look at evaluation approaches for customer support AI:
- Human Evaluation: The gold standard, but doesn’t scale
- Offline Benchmarking: Testing against curated datasets with known answers
- Online Evaluation: Monitoring live interactions and collecting user feedback
- Multi-dimensional Scoring: Evaluating across different attributes (accuracy, helpfulness, tone)
- LLM-as-a-Judge: Using a powerful model to simulate expert human judgment
This cookbook focuses on building a robust LLM-as-a-Judge system that balances accuracy with scalability, allowing you to evaluate thousands of customer interactions automatically.
Working with Portkey’s Prompt Studio
We will be using Prompt Studio in this cookbook. Unlike traditional approaches where prompts are written directly in code, Portkey allows you to:
- Create and manage prompts through an intuitive UI
- Version control your prompts
- Access prompts via simple API calls
- Deploy prompts to different environments
We use Mustache templating {{variable}}
in our prompts, which allows for dynamic content insertion. This makes our prompts more flexible and reusable.
What are Prompt Partials?
Prompt partials are reusable components in Portkey that let you modularize parts of your prompts. Think of them like building blocks that can be combined to create complex prompts. In this guide, we’ll create several partials (company info, guidelines, examples) and then combine them in a main prompt.
To follow this guide, you will need to create prompt partials first, then create the main template in the Portkey UI, and finally access them using the prompt_id inside your codebase.
Step-by-Step Guide to Building LLM-as-a-Judge
The Judge Prompt Structure To build an effective LLM judge, we need to create a well-structured prompt that gives the model all the context it needs to make accurate evaluations. Our judge prompt will consist of four main components:
- Company Information - Details about your company, products, and support policies that help the judge understand the context of customer interactions
- Evaluation Guidelines - Specific criteria for what makes a good or bad response in your customer support context
- Golden Examples - Sample evaluations that demonstrate how to apply the guidelines to real iteractions
- Main Judge Template - This brings everything together and creates the Judgement System
Step 1: Define Your Company Information in a Partial
First, we’ll create a partial that provides context about your company, products, and support policies. This helps the judge evaluate responses in the proper context.
Here’s an example of what your company info partial might look like:
This partial gives the judge important context about your products, return policy, shipping policies, support channels, and special programs. Customize this to match your own company’s specifics.
After creating this partial in Portkey, you’ll get a partial ID (e.g., pl-llm-as-0badba
) that you’ll reference in your main prompt template.
Step 2: Define the Evaluation Guidelines Partial
Next, create a partial that defines the criteria for evaluating responses. This ensures consistent quality standards.
Here’s an example of evaluation guidelines:
These guidelines define your primary evaluation criteria, secondary considerations, what constitutes good vs. bad responses, and format requirements for critiques. You can adjust these based on what matters most for your specific customer support context.
After creating this partial, you’ll receive another partial ID (e.g., pl-llm-as-1e1952
) to reference in your main template.
Step 3: Create Golden Examples Partial
Now create a partial with example evaluations. These examples “teach” the LLM what good and bad responses look like in your specific context.
Here’s what your examples partial might look like:
When creating your examples:
- Include diverse scenarios covering different types of customer questions
- Show the reasoning process by explaining why an answer is good or bad
- Include both good and bad examples
- Match your actual use cases with examples that reflect your real customer interactions
- Be consistent with the format structure
After creating this partial, you’ll receive another partial ID (e.g., pl-exampl-55b6e3
) to reference in your main template.
Step 4: Create the Main Judge Prompt Template
Now that you have all the partials, it’s time to create the main judge prompt template that brings everything together. We will reference the partials we created earlier to provide context, guidelines, and examples to the judge using mustache variables.
Here’s what your main prompt template should look like:
This template sets the evaluator role, inserts your company information, guidelines, and examples, and provides placeholders for the customer query and agent response. Make sure to select an appropriate model (like OpenAI o1, DeepSeek R1) when creating this template.
Once you’ve created the main template, you’ll get a prompt ID that you’ll use in your code to access this prompt.
Step 5: Implementing the Evaluation Code with Structured Output
Now that you have your prompt template set up in Portkey, use this Python code to evaluate customer support interactions with structured output:
Step 6: Iterate with Domain Experts
The most important part of building an effective LLM-as-a-Judge is iterating on your prompt with feedback from domain experts:
- Create a small test dataset with 20-30 representative customer support interactions
- Have human experts evaluate these interactions using the same criteria
- Compare the LLM judge results with human expert evaluations
- Calculate agreement rate and identify patterns in disagreements
- Update your prompt based on what you learn
Focus especially on adding examples that cover edge cases where the judge disagreed with experts, clarifying evaluation criteria, and adjusting the weight given to different factors based on business priorities.
Portkey Observability for Continuous Improvement
One of Portkey’s key advantages is its built-in observability. Each evaluation generates detailed traces showing execution time and token usage, input and output logs for debugging, and performance metrics across evaluations.
This visibility helps you identify performance bottlenecks, track costs as you scale, debug problematic evaluations, and compare different judge prompt versions.
Visualizing Evaluation Results on the Portkey Dashboard
The feedback data we collect using the portkey.feedback.create()
method automatically appears in the Portkey dashboard, allowing you to:
- Track evaluation outcomes over time
- Identify specific areas where your agent consistently struggles
- Measure improvement after making changes to your AI agent
- Share results with stakeholders through customizable reports
The dashboard gives you a bird’s-eye view of your evaluation metrics, making it easy to spot trends and areas for improvement.
Running Evaluation on Scale
This code runs your evaluator on an entire dataset, collects the results, and calculates an overall pass rate.
Next Steps
After implementing your LLM-as-a-Judge system, here are key ways to leverage it:
- Analyze quality trends: Track pass rates over time to measure improvement
- Identify systematic issues: Look for patterns in failing responses to address root causes
- Improve your support AI: Use the detailed critiques to refine your support system
Conclusion
An LLM-as-a-Judge system transforms how you approach customer support quality assurance. Rather than sampling a tiny fraction of interactions or relying on vague metrics, you can evaluate every interaction with consistency and depth. The detailed critiques provide actionable insights that drive continuous improvement in your customer support AI.
By implementing this approach with Portkey, you create a scalable quality assurance system that grows with your support operations while maintaining the high standards your customers expect.
Ready to build your own LLM-as-a-Judge system? Get started with Portkey today.