08 Agent Optimization and Performance Evaluation
Summary
This material covers methods for optimizing and evaluating AI Agents, focusing on practical tools and frameworks. It introduces a hybrid agent architecture using LangChain/LangGraph, demonstrates the
Agent Capability Optimization and Evaluation Objectives
This section outlines the learning goals:
- Agent Effect Evaluation: Understand and apply tools like LangSmith, OpenEvals, and LangFuse.
- CASE: Investment Advisor AI Assistant: A practical example demonstrating a hybrid agent's design and evaluation.
- Agent Automated Testing: Learn to define test sets, create evaluators, and execute evaluation workflows.
CASE: Investment Advisor AI Assistant
This case study demonstrates a Hybrid Intelligent Agent Architecture for wealth management. It balances reactive, immediate responses with deliberative, long-term strategic planning.
Architecture Design
The agent uses a three-layer design:
- Bottom Layer (Reactive): Handles simple, direct queries (e.g., market status, account info). Provides millisecond-level responses based on pre-set rules.
- Middle Layer (Coordination): Assesses query type and priority, dynamically choosing between reactive or deliberative processing. Manages resource allocation.
- Top Layer (Deliberative): Processes complex analytical queries (e.g., portfolio adjustment, financial planning). Involves multi-step, deep thinking, model building, and generating multiple solutions.
Processing Flow
-
Step 1: Query Assessment Stage
- The Coordination Layer evaluates the customer query to determine its type (emergency, informational, analytical) and the appropriate processing mode (reactive or deliberative).
-
Step 2A: Reactive Processing Flow
-
For simple queries (e.g., "How is the stock market today?").
-
Characterized by low latency, high response speed, direct data retrieval, and concise output.
-
-
Step 2B: Deliberative Processing Flow
-
For complex analytical queries (e.g., "How to adjust my portfolio for an economic recession?").
-
Involves data collection, deep analysis, and generation of multi-step recommendations.
-
State Management
The agent's state is maintained using a WealthAdvisorState (a TypedDict) to track user query, customer profile, query type, processing mode, intermediate results (market data, analysis), final response, current phase, and errors.
from typing import Optional, Dict, Any, Literal, TypedDict
class WealthAdvisorState(TypedDict):
user_query: str
customer_profile: Optional[Dict[str, Any]]
query_type: Optional[Literal["emergency", "informational", "analytical"]]
processing_mode: Optional[Literal["reactive", "deliberative"]]
emergency_response: Optional[Dict[str, Any]]
market_data: Optional[Dict[str, Any]]
analysis_results: Optional[Dict[str, Any]]
final_response: Optional[str]
current_phase: Literal["assess", "reactive", "collect_data", "analyze", "recommend", "respond"]
error: Optional[str]
LangSmith Usage
LangSmith provides a comprehensive toolchain for LLM applications:
- Debugging & Tracing: Real-time tracking of LLM calls, tool usage, and Agent decisions.
- Performance Monitoring: Monitors response time, token usage, costs, etc.
- Testing & Evaluation: Create test datasets to evaluate model output quality.
- Data Analysis: Analyze user query patterns, error rates, success rates.
Setup and Configuration
-
Get API Key: From
https://smith.langchain.com. -
Set Environment Variables:
ShellLANGSMITH_API_KEY=your-api-key-here LANGCHAIN_TRACING_V2=true LANGCHAIN_PROJECT="wealth-advisor-hybrid-agent" # Optional -
Automatic Tracing with
RunnableConfig:RunnableConfigis used to add tags and metadata to runs for filtering, grouping, and troubleshooting in the LangSmith UI.Pythonfrom langchain_core.runnables import RunnableConfig from datetime import datetime # Prepare LangSmith configuration customer_id = "customer1" # Example user_query = "Today's market performance?" # Example customer_profile = {"risk_tolerance": "balanced"} # Example config = RunnableConfig( tags=[ "wealth-advisor", "hybrid-agent", f"customer-{customer_id}", customer_profile.get("risk_tolerance", "unknown") ], metadata={ "customer_id": customer_id, "risk_tolerance": customer_profile.get("risk_tolerance"), "user_query": user_query[:100], "timestamp": datetime.now().isoformat() }, run_name=f"wealth-advisor-{customer_id}-{datetime.now().strftime('%Y%m%d%H%M%S')}" ) # Run the agent (automatic tracing) # result = agent.invoke(initial_state, config=config)
Debugging and Visualization
LangSmith provides views to inspect overall inputs/outputs, detailed LLM prompts and responses for each step, and a waterfall visualization to identify performance bottlenecks. Examples can be added to datasets directly from traces for future regression testing and few-shot example generation.
LangSmith Automated Testing
Step 1: Define Test Datasets
Test cases are categorized into reactive, deliberative, and edge cases, each with inputs and expected outputs.
# Example: Reactive Query Test Case
REACTIVE_TEST_CASES = [
{
"inputs": {
"user_query": "今天上证指数的表现如何?",
"customer_id": "customer1"
},
"expected_outputs": {
"processing_mode": "reactive",
"should_contain": ["上证指数", "点位", "涨跌"]
}
},
]
# Example: Deliberative Query Test Case
DELIBERATIVE_TEST_CASES = [
{
"inputs": {
"user_query": "根据当前市场情况,我应该如何调整投资组合以应对可能的经济衰退?",
"customer_id": "customer1"
},
"expected_outputs": {
"processing_mode": "deliberative",
"should_contain": ["投资组合", "调整", "经济衰退", "建议"]
}
},
]
# Example: Edge Case Test Case
EDGE_CASE_TEST_CASES = [
{
"inputs": {
"user_query": "", # Empty query
"customer_id": "customer1"
},
"expected_outputs": {
"should_handle_error": True
}
},
]
ALL_TEST_CASES = REACTIVE_TEST_CASES + DELIBERATIVE_TEST_CASES + EDGE_CASE_TEST_CASES
Step 2: Create Evaluators
Custom evaluators assess specific aspects of the agent's performance.
ProcessingModeEvaluator: Checks if the Agent correctly selected the processing mode (reactive vs. deliberative).ResponseCompletenessEvaluator: Verifies if the response contains expected keywords.
These evaluators return a score (0.0 or 1.0 for binary, 0.0-1.0 for continuous) and a comment.
Step 3: Evaluation Execution
LangSmith's evaluate() function automates the process:
- Environment Check: Ensure
LANGSMITH_API_KEYis set. - Run Agent: The agent is executed for each test case.
- Apply Evaluators: Custom and built-in evaluators score the agent's output.
- Result Display: Provides a link to the LangSmith UI for detailed results.
LangSmith and Prompt Ops
Prompt Ops is an engineering approach to systematically manage, test, optimize, and monitor LLM prompts for quality and consistency.
LangSmith Support for Prompt Ops
- Prompt Version Management: Different prompt versions are tagged (
experiment_prefix,tags,run_name) in the code. LangSmith allows filtering and comparing these versions by success rate, completeness, latency, cost, etc., in the UI. - Continuous Optimization: Prompts are iteratively improved by modifying code, running evaluation scripts, and analyzing assessment and production data in LangSmith.
OpenEvals Usage
OpenEvals is an independent, open-source evaluator library developed by the LangChain team. It provides a set of pre-built evaluators that can be deeply integrated with LangSmith.
Relationship with LangSmith
- OpenEvals provides the implementations of various evaluators.
- LangSmith provides the platform and infrastructure for running and managing these evaluations. They work in conjunction but are not nested.
Built-in Evaluators
OpenEvals offers a wide range of evaluators including:
CORRECTNESS_PROMPT: Verifies factual accuracy.CONCISENESS_PROMPT: Checks for brevity.ANSWER_RELEVANCE_PROMPT: Assesses how relevant the answer is to the query.RAG_HELPFULNESS_PROMPT: Evaluates the practical usefulness of RAG-generated answers.RAG_GROUNDEDNESS_PROMPT: Checks if the answer is supported by retrieved context.RAG_RETRIEVAL_RELEVANCE_PROMPT: Assesses the quality of retrieved documents.TOXICITY_PROMPT: Detects harmful content.HALLUCINATION_PROMPT: Identifies unsupported claims.CODE_CORRECTNESS_PROMPT: Evaluates code correctness.PLAN_ADHERENCE_PROMPT: Assesses how well an Agent follows its plan.
Example: Correctness Evaluation
from openevals.prompts import CORRECTNESS_PROMPT
from openevals.llm import create_llm_as_judge
from langchain_community.chat_models import ChatTongyi # Example LLM
eval_llm = ChatTongyi(model_name="qwen-turbo", temperature=0) # Use your LLM
evaluator = create_llm_as_judge(
prompt=CORRECTNESS_PROMPT,
feedback_key="correctness",
judge=eval_llm,
continuous=True,
use_reasoning=False,
)
result = evaluator(
inputs="什么是机器学习?",
outputs="机器学习是人工智能的一个分支,通过算法让计算机从数据中学习。",
reference_outputs="机器学习是让计算机从数据中学习的技术。"
)
print(f"评估分数: {result['score']}") # Example Output: 0.8
CASE: Automated Testing with OpenEvals
This section integrates OpenEvals into the investment advisor AI assistant's automated testing.
Selected Evaluators
-
OpenEvals Pre-defined Evaluators:
-
ANSWER_RELEVANCE_PROMPT -
CONCISENESS_PROMPT -
RAG_HELPFULNESS_PROMPT -
HALLUCINATION_PROMPT(critical for financial advice) -
TOXICITY_PROMPT(ensures compliance and safety)
-
-
Custom Evaluators:
-
ProcessingModeEvaluator(same as in LangSmith testing) -
ResponseCompletenessEvaluator(same as in LangSmith testing)
-
Output Preprocessing for OpenEvals
To allow LLM-as-a-judge evaluators to extract specific data (like processing_mode), the agent's output is formatted specially.
# From the investment advisor agent
# result = run_wealth_advisor(user_query=user_query, customer_id=customer_id)
# final_response = result.get("final_response", "")
# processing_mode = result.get("processing_mode", "unknown")
# Special format for evaluators to extract processing_mode
output_text = f"[处理模式: {processing_mode}]\n\n{final_response}"
# Return structured output for evaluations
# return {
# "output": output_text, # Full output including processing mode hint
# "final_response": final_response, # Original clean response
# "processing_mode": processing_mode, # Explicit mode
# }
Custom Evaluator Prompts
Specific prompts are designed for the ProcessingModeEvaluator and ResponseCompletenessEvaluator to instruct the LLM judge on how to extract information and score.
# PROCESSING_MODE_PROMPT (Simplified)
PROCESSING_MODE_PROMPT = """You are an evaluator assessing if the investment assistant chose the correct processing mode.
<Instructions>
Extract the actual processing mode from the output's beginning "[处理模式: xxx]".
Extract the expected processing mode from `reference_outputs` (e.g., {{"processing_mode": "reactive"}}).
Score 1.0 if they match, 0.0 otherwise.
</Instructions>
<User Query>{inputs}</User Query>
<Actual Output>{outputs}</Actual Output>
<Expected Processing Mode>{reference_outputs}</Expected Processing Mode>"""
# RESPONSE_COMPLETENESS_PROMPT (Simplified)
RESPONSE_COMPLETENESS_PROMPT = """You are an evaluator assessing if the assistant's response is complete and contains expected keywords.
<Instructions>
Ignore "[处理模式: xxx]" from the output.
Extract expected keywords from `reference_outputs` (e.g., {{"should_contain": ["keyword1", "keyword2"]}}).
Score based on the proportion of keywords found (0-1.0).
</Instructions>
<User Query>{inputs}</User Query>
<Assistant's Answer>{outputs}</Assistant's Answer>
<Expected Keywords>{reference_outputs}</Expected Keywords>"""
Evaluator Creation
# Create evaluation LLM
# eval_llm = ChatTongyi(model_name="qwen-turbo", dashscope_api_key=os.getenv("DASHSCOPE_API_KEY"), temperature=0)
# Create custom evaluators using create_llm_as_judge
# processing_mode_evaluator = create_llm_as_judge(
# prompt=PROCESSING_MODE_PROMPT, feedback_key="processing_mode", judge=eval_llm, continuous=True, use_reasoning=False)
# response_completeness_evaluator = create_llm_as_judge(
# prompt=RESPONSE_COMPLETENESS_PROMPT, feedback_key="response_completeness", judge=eval_llm, continuous=True, use_reasoning=False)
Non-LangChain Family Agent Tools
DeepEval
DeepEval is an open-source LLM evaluation framework.
- Purpose: Analogous to Pytest/JUnit for traditional software, focusing on systematic quality testing and evaluation before deployment.
- Features: Provides over 40 built-in evaluation metrics (e.g., Hallucination, Faithfulness, Answer Relevancy, Toxicity, Bias, G-Eval).
- Comparison with LangSmith: DeepEval is for offline testing and scoring (CI/CD), while LangSmith is for online monitoring, debugging, and tracing (production). They can complement each other.
- Comparison with OpenEvals: Both are open-source evaluation libraries. DeepEval has more extensive metrics (RAGAS, Helm) and offers G-Eval syntax for custom evaluators. OpenEvals is more deeply integrated with the LangChain/LangSmith ecosystem.
Qwen Agent vs. LangChain Agent
Qwen Agent provides an alternative framework for building agents, particularly optimized for Chinese language processing.
Feature | LangChain/LangGraph | Qwen-Agent
Architecture | Explicit state management, graph-based | Implicit state management, declarative
Code Volume | Higher (defines nodes, edges, states) | Lower (reduces ~47%)
Learning Curve | Steeper (2-3 weeks) | Flatter (3-5 days)
Controllability | High (precise control) | Medium (framework-controlled)
Visualization | LangSmith Web UI | Built-in Web UI
Tool Registration | Manual management | Decorator-based
Chinese Support | General | Optimized
Applicable Scenes | Complex workflows, enterprise, LangChain ecosystem | Rapid prototyping, Chinese, simple conversational
LangFuse Usage
LangFuse is an open-source LLM engineering platform focused on "observability + debugging + evaluation."
-
Purpose: Provides tracing, prompt management, and lightweight evaluation for any LLM framework or model.
-
Comparison with LangSmith:
Code_LangFuse:_* Fully open-source, flexible integration, focuses on core observability and light evaluation. _LangSmith:_* Official LangChain commercial product, deep integration with LangChain, comprehensive enterprise-grade testing, evaluation, and monitoring.
Integration Steps for LangFuse
-
Register: Create an account on LangFuse.
-
Configure Environment Variables:
ShellLANGFUSE_SECRET_KEY = "sk-XX" LANGFUSE_PUBLIC_KEY = "pk-XX" LANGFUSE_BASE_URL = "https://us.cloud.langfuse.com" -
Integrate: Add LangFuse client to your agent code (e.g., wrapping LLM calls or agent runs).
Key Takeaways
- Hybrid Agent Architectures: Combining reactive and deliberative components can balance speed and depth, crucial for complex applications like financial advisors.
- LangSmith as an LLMOps Hub: It offers comprehensive tools for debugging, performance monitoring, automated testing, and prompt versioning, vital for maintaining and improving LLM applications.
- OpenEvals for Standardized Evaluation: This open-source library provides pre-built and customizable evaluators that integrate seamlessly with LangSmith, enabling systematic quality assessment of LLM outputs.
- Diverse Ecosystem of LLM Tools: Beyond LangChain, tools like DeepEval (offline testing), Qwen Agent (optimized for specific scenarios), and LangFuse (open-source observability) offer alternatives or complements depending on project needs.
- Importance of Automated Testing: Defining robust test sets and custom evaluators is essential for continuous integration and ensuring the reliability and quality of AI Agent responses.