FutureAGI is an AI lifecycle platform that provides automated evaluation, tracing, and quality assessment for LLM applications. When combined with Portkey, you get a complete end-to-end observability solution covering both operational performance and response quality. 
Portkey handles the “what happened, how fast, and how much did it cost?” while FutureAGI answers “how good was the response?” 
 
Why FutureAGI + Portkey?  
The integration creates a powerful synergy: 
Portkey  acts as the operational layer - unifying API calls, managing keys, and monitoring metrics like latency, cost, and request volume 
FutureAGI  acts as the quality layer - capturing full request context and running automated evaluations to score model outputs 
 
Getting Started  
Prerequisites  
Before integrating FutureAGI with Portkey, ensure you have: 
Python 3.8+ installed 
API Keys:
 
 
Installation  
pip  install  portkey-ai  fi-instrumentation  traceai-portkey  
 
Setting up Environment Variables  
Create a .env file in your project root: 
# .env  
PORTKEY_API_KEY = "your-portkey-api-key"  
FI_API_KEY = "your-futureagi-api-key"  
FI_SECRET_KEY = "your-futureagi-secret-key"  
 
Integration Guide  
Step 1: Basic Setup  
Import the necessary libraries and configure your environment: 
import  asyncio  
import  json  
import  time  
from  portkey_ai  import  Portkey  
from  traceai_portkey  import  PortkeyInstrumentor  
from  fi_instrumentation  import  register  
from  fi_instrumentation.fi_types  import  (  
    ProjectType, EvalTag, EvalTagType,  
    EvalSpanKind, EvalName, ModelChoices  
)  
from  dotenv  import  load_dotenv  
 
load_dotenv()  
 
Set up comprehensive evaluation tags to automatically assess model responses: 
def  setup_tracing ( project_version_name :  str ):  
    """Setup tracing with comprehensive evaluation tags"""  
    tracer_provider  =  register(  
        project_name = "Model-Benchmarking" ,  
        project_type = ProjectType. EXPERIMENT ,  
        project_version_name = project_version_name,  
        eval_tags = [  
            # Evaluates if the response is concise  
            EvalTag(  
                type = EvalTagType. OBSERVATION_SPAN ,  
                value = EvalSpanKind. LLM ,  
                eval_name = EvalName. IS_CONCISE ,  
                custom_eval_name = "Is_Concise" ,  
                mapping = { "input" :  "llm.output_messages.0.message.content" },  
                model = ModelChoices. TURING_LARGE  
            ),  
            # Evaluates context adherence  
            EvalTag(  
                type = EvalTagType. OBSERVATION_SPAN ,  
                value = EvalSpanKind. LLM ,  
                eval_name = EvalName. CONTEXT_ADHERENCE ,  
                custom_eval_name = "Response_Quality" ,  
                mapping = {  
                    "context" :  "llm.input_messages.0.message.content" ,  
                    "output" :  "llm.output_messages.0.message.content" ,  
                },  
                model = ModelChoices. TURING_LARGE  
            ),  
            # Evaluates task completion  
            EvalTag(  
                type = EvalTagType. OBSERVATION_SPAN ,  
                value = EvalSpanKind. LLM ,  
                eval_name = EvalName. TASK_COMPLETION ,  
                custom_eval_name = "Task_Completion" ,  
                mapping = {  
                    "input" :  "llm.input_messages.0.message.content" ,  
                    "output" :  "llm.output_messages.0.message.content" ,  
                },  
                model = ModelChoices. TURING_LARGE  
            ),  
        ]  
    )  
    # Instrument the Portkey library  
    PortkeyInstrumentor().instrument( tracer_provider = tracer_provider)  
    return  tracer_provider  
 
The mapping parameter in EvalTag tells the evaluator where to find the necessary data within the trace. This is crucial for accurate evaluation. 
 
Step 3: Define Models and Test Scenarios  
Configure the models you want to test and create test scenarios: 
def  get_models ():  
    """Setup model configurations with their Portkey Virtual Keys"""  
    return  [  
        {  
            "name" :  "GPT-4o" ,  
            "provider" :  "OpenAI" ,  
            "virtual_key" :  "openai-virtual-key" ,  
            "model_id" :  "gpt-4o"  
        },  
        {  
            "name" :  "Claude-3.7-Sonnet" ,  
            "provider" :  "Anthropic" ,  
            "virtual_key" :  "anthropic-virtual-key" ,  
            "model_id" :  "claude-3-7-sonnet-latest"  
        },  
        {  
            "name" :  "Llama-3-70b" ,  
            "provider" :  "Groq" ,  
            "virtual_key" :  "groq-virtual-key" ,  
            "model_id" :  "llama3-70b-8192"  
        },  
    ]  
 
def  get_test_scenarios ():  
    """Returns a dictionary of test scenarios"""  
    return  {  
        "reasoning_logic" :  "A farmer has 17 sheep. All but 9 die. How many are left?" ,  
        "creative_writing" :  "Write a 6-word story about a robot who discovers music." ,  
        "code_generation" :  "Write a Python function to find the nth Fibonacci number." ,  
    }  
 
Step 4: Execute Tests with Automatic Evaluation  
Run tests on each model while capturing both operational metrics and quality evaluations: 
async  def  test_model ( model_config ,  prompt ):  
    """Tests a single model with a single prompt and returns the response"""  
 
    tracer_provider  =  setup_tracing(model_config[ "name" ])  
 
    print ( f "Testing  { model_config[ 'name' ] } ..." )  
 
    client  =  Portkey( virtual_key = model_config[ 'virtual_key' ])  
    start_time  =  time.time()  
 
    completion  =  await  client.chat.completions.create(  
        messages = [{ "role" :  "user" ,  "content" : prompt}],  
        model = model_config[ 'model_id' ],  
        max_tokens = 1024 ,  
        temperature = 0.5  
    )  
    response_time  =  time.time()  -  start_time  
    response_text  =  completion.choices[ 0 ].message.content  or  ""  
 
    return  response_text  
 
async  def  main ():  
    """Main execution function to run all tests"""  
    models_to_test  =  get_models()  
    scenarios  =  get_test_scenarios()  
 
    for  test_name, prompt  in  scenarios.items():  
        print ( f " \n { '=' * 20 }  SCENARIO:  { test_name.upper() }  { '=' * 20 } " )  
        print ( f "PROMPT:  { prompt } " )  
        print ( "-"  *  60 )  
 
        for  model  in  models_to_test:  
            await  test_model(model, prompt)  
 
        await  asyncio.sleep( 1 )   # Brief pause between scenarios  
        PortkeyInstrumentor().uninstrument()  
 
if  __name__  ==  "__main__" :  
    asyncio.run(main())  
 
Viewing Results  
After running your tests, you’ll have two powerful dashboards to analyze performance: 
FutureAGI Dashboard - Quality View  
Navigate to the Prototype Tab  in your FutureAGI Dashboard to find your “Model-Benchmarking” project. 
Key features: 
Automated evaluation scores for each model response 
Detailed trace analysis with quality metrics 
Comparison views across different models 
 
Portkey Dashboard - Operational View  
Access your Portkey dashboard to see operational metrics for all API calls: 
Key metrics: 
Unified Logs : Single view of all requests across providers 
Cost Tracking : Automatic cost calculation for every call 
Latency Monitoring : Response time comparisons across models 
Token Usage : Detailed token consumption analytics 
 
Advanced Use Cases  
Complex Agentic Workflows  
The integration supports tracing complex workflows where you chain multiple LLM calls: 
# Example: E-commerce assistant with multiple LLM calls  
async  def  ecommerce_assistant_workflow ( user_query ):  
    # Step 1: Intent classification  
    intent  =  await  classify_intent(user_query)  
 
    # Step 2: Product search  
    products  =  await  search_products(intent)  
 
    # Step 3: Generate response  
    response  =  await  generate_response(products, user_query)  
 
    # All steps are automatically traced and evaluated  
    return  response  
 
CI/CD Integration  
Leverage this integration in your CI/CD pipelines for: 
Automated Model Testing : Run evaluation suites on new model versions 
Quality Gates : Set thresholds for evaluation scores before deployment 
Performance Monitoring : Track degradation in model quality over time 
Cost Optimization : Monitor and alert on cost spikes 
 
Benefits  
Comprehensive Observability Track both operational metrics (cost, latency) and quality metrics (accuracy, relevance) in one place 
Automated Evaluation No manual evaluation needed - FutureAGI automatically scores responses on multiple dimensions 
Multi-Model Comparison Easily compare different models side-by-side on the same tasks 
Production Ready Built-in alerting and monitoring for your production LLM applications 
 
Example Notebooks  
Interactive Colab Notebook Try out the FutureAGI + Portkey integration with our interactive notebook 
 
Next Steps  
Create your FutureAGI account  
Set up Virtual Keys in Portkey  
Run the example code to see automated evaluation in action 
Customize evaluation tags for your specific use cases 
Integrate into your CI/CD pipeline for continuous model quality monitoring