Skip to content

Agent Evaluations

Agent evaluation is about understanding how well your agent actually performed — whether it understood the task, took the right steps, used tools when needed, and produced a useful result. Use it when your system behaves like an agent (reasoning, tool use, multi-step workflows).

Floeval supports three modes for agent evaluation:

ModeDescriptionUse when
Pre-captured tracesYou provide traces in the datasetYou already have conversation logs
Local agentFloeval runs your agent in your Python environment and captures tracesYour agent runs locally (LangChain, custom)
FloTorch hostedFloeval calls your agent on the FloTorch gatewayYour agent is deployed in the FloTorch Console

MetricProviderWhat it measures
goal_achievementbuiltinDid the agent achieve the goal? (LLM-as-judge)
response_coherencebuiltinIs the final response consistent with the trace?
ragas:agent_goal_accuracyRAGASAgent output vs expected outcome
ragas:tool_call_accuracyRAGASWere tool calls correct? (needs reference_tool_calls)

Agent datasets use a "samples" array, similar to LLM/RAG datasets. The key difference is the trace field, which contains the full conversation log including tool calls. Always use "name" (not "tool") in reference_tool_calls.

Use a full dataset when you already have recorded agent conversations (from your app logs, LangChain callbacks, or manual export). Each sample includes the complete trace with messages, tool calls, and the final response:

{
"samples": [
{
"user_input": "Get the weather for London",
"trace": {
"messages": [
{"role": "human", "content": "Get the weather for London"},
{"role": "ai", "content": "", "tool_calls": [{"name": "get_weather", "args": {"city": "London"}}]},
{"role": "tool", "content": "Sunny, 22°C", "tool_name": "get_weather", "tool_call_id": "call_1"},
{"role": "ai", "content": "The weather in London is sunny with 22°C.", "tool_calls": []}
],
"final_response": "The weather in London is sunny with 22°C.",
"metadata": {}
},
"reference_outcome": "Provides London weather",
"reference_tool_calls": [{"name": "get_weather", "args": {"city": "London"}}]
}
]
}

Partial dataset — no trace (agent runs at evaluation time)

Section titled “Partial dataset — no trace (agent runs at evaluation time)”

Use a partial dataset when you have test cases but no recorded traces. Floeval runs your agent for each sample and captures the trace automatically:

{
"samples": [
{
"user_input": "Get the weather for London",
"reference_outcome": "Provides London weather",
"reference_tool_calls": [{"name": "get_weather", "args": {"city": "London"}}]
}
]
}

JSONL is also supported (one sample per line).


Use this mode when you already have conversation logs from your app, LangChain callbacks, or manual export. No agent needs to run during evaluation — Floeval scores the traces directly.

Create a config file with your LLM credentials and the agent metrics you want to run. Use --agent to tell the CLI this is an agent evaluation:

agent_config.yaml
llm_config:
base_url: "https://api.openai.com/v1"
api_key: "your-api-key"
chat_model: gpt-4o-mini
evaluation_config:
default_provider: "builtin"
metrics:
- goal_achievement
- response_coherence
Terminal window
floeval evaluate --agent -c agent_config.yaml -d agent_full.json -o results.json

Use the AgentEvaluation class to run agent evaluations from code. Load your pre-captured traces from a JSON file and pass the metrics you want to score:

from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset
from floeval.config.schemas.io.llm import OpenAIProviderConfig
llm_config = OpenAIProviderConfig(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
chat_model="gpt-4o-mini",
embedding_model="text-embedding-3-small",
)
dataset = AgentDataset.from_file("agent_full.json")
evaluation = AgentEvaluation(
dataset=dataset,
llm_config=llm_config,
metrics=["goal_achievement", "response_coherence"],
default_provider="builtin",
)
results = evaluation.run()
print(results.summary["aggregate_scores"])

Use this mode when your agent runs locally in your Python environment. Floeval wraps your agent with wrap_langchain_agent, runs it for each test case, captures the trace automatically, and then scores it. Requires langchain and langchain-openai.

Save your test cases as a JSONL file (one sample per line). Each sample includes the question, expected outcome, and optionally the expected tool calls:

{"user_input": "What's the weather in Paris?", "reference_outcome": "Provides Paris weather", "reference_tool_calls": [{"name": "get_weather", "args": {"city": "Paris"}}]}
{"user_input": "What's the weather in London?", "reference_outcome": "Provides London weather", "reference_tool_calls": [{"name": "get_weather", "args": {"city": "London"}}]}
import os
from pathlib import Path
from langchain_openai import ChatOpenAI
from langchain.agents import create_agent
from langchain_core.tools import tool
from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.utils.agent_trace import wrap_langchain_agent
@tool
def get_weather(city: str) -> str:
"""Get weather for a city."""
db = {"paris": "18°C", "tokyo": "22°C", "london": "14°C"}
return db.get(city.lower(), f"{city}: No data")
llm_config = OpenAIProviderConfig(
base_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
api_key=os.getenv("OPENAI_API_KEY", "your-api-key"),
chat_model="gpt-4o-mini",
)
llm = ChatOpenAI(model=llm_config.chat_model, api_key=llm_config.api_key, base_url=llm_config.base_url)
agent = create_agent(model=llm, tools=[get_weather], system_prompt="Use tools when needed.")
dataset = AgentDataset.from_file(Path("agent_partial.jsonl"))
evaluation = AgentEvaluation(
dataset=dataset,
llm_config=llm_config,
metrics=["goal_achievement", "response_coherence", "ragas:tool_call_accuracy"],
default_provider="builtin",
agent=wrap_langchain_agent(agent),
)
results = evaluation.run()
print(results.summary["aggregate_scores"])

For agents that don’t use LangChain, decorate your agent function with @capture_trace and use log_turn and log_tool_result to record what happens during execution. Floeval uses these logs to build the trace:

from floeval.utils.agent_trace import capture_trace, log_turn, log_tool_result
from floeval.config.schemas.io.agent_dataset import ToolCall
@capture_trace
def my_agent(user_input: str) -> str:
search_result = f"Mock search for: {user_input}"
log_tool_result("search", search_result)
final = f"Answer based on: {search_result}"
log_turn(output=final, tool_calls=[ToolCall(name="search", args={"query": user_input})])
return final

Pass agent=my_agent to AgentEvaluation.


Use this mode when your agent is deployed in the FloTorch Console. Floeval sends each test case to the gateway, the hosted agent processes it, and Floeval captures the trace and scores the result. Requires pip install floeval[flotorch].

Prepare test cases with user_input and optionally reference_outcome and reference_tool_calls:

{
"samples": [
{
"user_input": "What is the weather in Tokyo?",
"reference_outcome": "Provides Tokyo weather",
"reference_tool_calls": [{"name": "get_weather", "args": {"city": "Tokyo"}}]
}
]
}

Use create_flotorch_runner to connect to your deployed agent. Pass the runner as agent_runner to AgentEvaluation:

from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.flotorch import create_flotorch_runner
llm_config = OpenAIProviderConfig(
base_url="https://your-gateway/v1",
api_key="your-api-key",
chat_model="gpt-4o-mini",
embedding_model="text-embedding-3-small",
)
runner = create_flotorch_runner("my-agent", llm_config=llm_config)
dataset = AgentDataset.from_file("agent_partial.json")
evaluation = AgentEvaluation(
dataset=dataset,
llm_config=llm_config,
metrics=["goal_achievement", "response_coherence", "ragas:tool_call_accuracy"],
default_provider="builtin",
agent_runner=runner,
)
results = evaluation.run()
print(results.summary["aggregate_scores"])

For command-line evaluation against FloTorch-hosted agents, add agent_name to evaluation_config to specify which deployed agent to call:

agent_config_flotorch.yaml
llm_config:
base_url: "https://gateway.flotorch.cloud/openai/v1"
api_key: "your-flotorch-api-key"
chat_model: flotorch/turbo
evaluation_config:
agent_name: "my-agent"
default_provider: "builtin"
metrics:
- goal_achievement
- response_coherence
Terminal window
floeval evaluate --agent -c agent_config_flotorch.yaml -d agent_partial.json -o results.json

Deploy agents in the FloTorch Agent Builder. Create API keys in Settings > API Keys.