Agent Evaluations

Agent evaluation is about understanding how well your agent actually performed — whether it understood the task, took the right steps, used tools when needed, and produced a useful result. Use it when your system behaves like an agent (reasoning, tool use, multi-step workflows).

Floeval supports three modes for agent evaluation:

Mode	Description	Use when
Pre-captured traces	You provide traces in the dataset	You already have conversation logs
Local agent	Floeval runs your agent in your Python environment and captures traces	Your agent runs locally (LangChain, custom)
FloTorch hosted	Floeval calls your agent on the FloTorch gateway	Your agent is deployed in the FloTorch Console

Available Metrics

Metric	Provider	What it measures
`goal_achievement`	builtin	Did the agent achieve the goal? (LLM-as-judge)
`response_coherence`	builtin	Is the final response consistent with the trace?
`ragas:agent_goal_accuracy`	RAGAS	Agent output vs expected outcome
`ragas:tool_call_accuracy`	RAGAS	Were tool calls correct? (needs `reference_tool_calls`)

Agent Dataset Formats

Agent datasets use a "samples" array, similar to LLM/RAG datasets. The key difference is the trace field, which contains the full conversation log including tool calls. Always use "name" (not "tool") in reference_tool_calls.

Full dataset — pre-captured traces

Use a full dataset when you already have recorded agent conversations (from your app logs, LangChain callbacks, or manual export). Each sample includes the complete trace with messages, tool calls, and the final response:

{
  "samples": [
    {
      "user_input": "Get the weather for London",
      "trace": {
        "messages": [
          {"role": "human", "content": "Get the weather for London"},
          {"role": "ai", "content": "", "tool_calls": [{"name": "get_weather", "args": {"city": "London"}}]},
          {"role": "tool", "content": "Sunny, 22°C", "tool_name": "get_weather", "tool_call_id": "call_1"},
          {"role": "ai", "content": "The weather in London is sunny with 22°C.", "tool_calls": []}
        ],
        "final_response": "The weather in London is sunny with 22°C.",
        "metadata": {}
      },
      "reference_outcome": "Provides London weather",
      "reference_tool_calls": [{"name": "get_weather", "args": {"city": "London"}}]
    }
  ]
}

Partial dataset — no trace (agent runs at evaluation time)

Use a partial dataset when you have test cases but no recorded traces. Floeval runs your agent for each sample and captures the trace automatically:

{
  "samples": [
    {
      "user_input": "Get the weather for London",
      "reference_outcome": "Provides London weather",
      "reference_tool_calls": [{"name": "get_weather", "args": {"city": "London"}}]
    }
  ]
}

JSONL is also supported (one sample per line).

Mode 1: Pre-Captured Traces

Use this mode when you already have conversation logs from your app, LangChain callbacks, or manual export. No agent needs to run during evaluation — Floeval scores the traces directly.

From the command line

Create a config file with your LLM credentials and the agent metrics you want to run. Use --agent to tell the CLI this is an agent evaluation:

llm_config:
  base_url: "https://api.openai.com/v1"
  api_key: "your-api-key"
  chat_model: gpt-4o-mini

evaluation_config:
  default_provider: "builtin"
  metrics:
    - goal_achievement
    - response_coherence

floeval evaluate --agent -c agent_config.yaml -d agent_full.json -o results.json

From code

Use the AgentEvaluation class to run agent evaluations from code. Load your pre-captured traces from a JSON file and pass the metrics you want to score:

from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset
from floeval.config.schemas.io.llm import OpenAIProviderConfig

llm_config = OpenAIProviderConfig(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    chat_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
)

dataset = AgentDataset.from_file("agent_full.json")

evaluation = AgentEvaluation(
    dataset=dataset,
    llm_config=llm_config,
    metrics=["goal_achievement", "response_coherence"],
    default_provider="builtin",
)

results = evaluation.run()
print(results.summary["aggregate_scores"])

Mode 2: Local Agent (LangChain)

Use this mode when your agent runs locally in your Python environment. Floeval wraps your agent with wrap_langchain_agent, runs it for each test case, captures the trace automatically, and then scores it. Requires langchain and langchain-openai.

Step 1: Create a partial dataset file

Save your test cases as a JSONL file (one sample per line). Each sample includes the question, expected outcome, and optionally the expected tool calls:

{"user_input": "What's the weather in Paris?", "reference_outcome": "Provides Paris weather", "reference_tool_calls": [{"name": "get_weather", "args": {"city": "Paris"}}]}
{"user_input": "What's the weather in London?", "reference_outcome": "Provides London weather", "reference_tool_calls": [{"name": "get_weather", "args": {"city": "London"}}]}

Step 2: Run evaluation

import os
from pathlib import Path
from langchain_openai import ChatOpenAI
from langchain.agents import create_agent
from langchain_core.tools import tool

from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.utils.agent_trace import wrap_langchain_agent

@tool
def get_weather(city: str) -> str:
    """Get weather for a city."""
    db = {"paris": "18°C", "tokyo": "22°C", "london": "14°C"}
    return db.get(city.lower(), f"{city}: No data")

llm_config = OpenAIProviderConfig(
    base_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
    api_key=os.getenv("OPENAI_API_KEY", "your-api-key"),
    chat_model="gpt-4o-mini",
)

llm = ChatOpenAI(model=llm_config.chat_model, api_key=llm_config.api_key, base_url=llm_config.base_url)
agent = create_agent(model=llm, tools=[get_weather], system_prompt="Use tools when needed.")

dataset = AgentDataset.from_file(Path("agent_partial.jsonl"))

evaluation = AgentEvaluation(
    dataset=dataset,
    llm_config=llm_config,
    metrics=["goal_achievement", "response_coherence", "ragas:tool_call_accuracy"],
    default_provider="builtin",
    agent=wrap_langchain_agent(agent),
)

results = evaluation.run()
print(results.summary["aggregate_scores"])

Custom agents with @capture_trace

For agents that don’t use LangChain, decorate your agent function with @capture_trace and use log_turn and log_tool_result to record what happens during execution. Floeval uses these logs to build the trace:

from floeval.utils.agent_trace import capture_trace, log_turn, log_tool_result
from floeval.config.schemas.io.agent_dataset import ToolCall

@capture_trace
def my_agent(user_input: str) -> str:
    search_result = f"Mock search for: {user_input}"
    log_tool_result("search", search_result)
    final = f"Answer based on: {search_result}"
    log_turn(output=final, tool_calls=[ToolCall(name="search", args={"query": user_input})])
    return final

Pass agent=my_agent to AgentEvaluation.

Mode 3: FloTorch Hosted Agent

Use this mode when your agent is deployed in the FloTorch Console. Floeval sends each test case to the gateway, the hosted agent processes it, and Floeval captures the trace and scores the result. Requires pip install floeval[flotorch].

Step 1: Create a partial dataset

Prepare test cases with user_input and optionally reference_outcome and reference_tool_calls:

{
  "samples": [
    {
      "user_input": "What is the weather in Tokyo?",
      "reference_outcome": "Provides Tokyo weather",
      "reference_tool_calls": [{"name": "get_weather", "args": {"city": "Tokyo"}}]
    }
  ]
}

Step 2: Run evaluation

Use create_flotorch_runner to connect to your deployed agent. Pass the runner as agent_runner to AgentEvaluation:

from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.flotorch import create_flotorch_runner

llm_config = OpenAIProviderConfig(
    base_url="https://your-gateway/v1",
    api_key="your-api-key",
    chat_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
)

runner = create_flotorch_runner("my-agent", llm_config=llm_config)
dataset = AgentDataset.from_file("agent_partial.json")

evaluation = AgentEvaluation(
    dataset=dataset,
    llm_config=llm_config,
    metrics=["goal_achievement", "response_coherence", "ragas:tool_call_accuracy"],
    default_provider="builtin",
    agent_runner=runner,
)

results = evaluation.run()
print(results.summary["aggregate_scores"])

Command line with FloTorch

For command-line evaluation against FloTorch-hosted agents, add agent_name to evaluation_config to specify which deployed agent to call:

llm_config:
  base_url: "https://gateway.flotorch.cloud/openai/v1"
  api_key: "your-flotorch-api-key"
  chat_model: flotorch/turbo

evaluation_config:
  agent_name: "my-agent"
  default_provider: "builtin"
  metrics:
    - goal_achievement
    - response_coherence

floeval evaluate --agent -c agent_config_flotorch.yaml -d agent_partial.json -o results.json

Deploy agents in the FloTorch Agent Builder. Create API keys in Settings > API Keys.

Next Steps

Workflow Evaluations — evaluate agentic workflows (multi-agent DAGs)
LLM Evaluations — evaluate standalone LLM quality
RAG Evaluations — evaluate retrieval-augmented generation