Workflow Evaluations
Workflow evaluations validate multi-agent pipelines where multiple agents work together in a DAG (directed acyclic graph). Floeval measures whether all agents responded correctly, evaluates each agent’s tool calls and outputs, and checks whether the overall workflow output is correct. Use this when you have multi-step workflows deployed on the FloTorch gateway.
Prerequisites: Create your agents in the FloTorch Console, link them in your workflow DAG, and create an API key for authentication. You can create any number of agents in the FloTorch Console and orchestrate them as sequential, parallel, or combined workflows. Requires pip install floeval[flotorch].
How It Works
Section titled “How It Works”- You define a DAG config with nodes (START, AGENT, END) and edges between them
- You create an agent dataset with test cases and expected outcomes
WorkflowRunnerexecutes the workflow by calling each agent node according to the DAG- Floeval scores each agent’s behavior and the overall workflow output
Step 1: Define the DAG Config
Section titled “Step 1: Define the DAG Config”The DAG config specifies the workflow structure as a graph of nodes and edges. Each AGENT node references a deployed agent by name (e.g. agent1:latest, agent2:latest). You can create sequential flows (agent1 → agent2 → agent3), parallel branches, or a combination of both.
Example: Sequential workflow
Section titled “Example: Sequential workflow”One agent runs after another. The output of each agent flows to the next:
{ "uid": "sequential-workflow-001", "name": "Sequential Workflow", "nodes": [ {"id": "start", "type": "START", "label": "Start"}, {"id": "agent1", "type": "AGENT", "label": "Agent 1", "agentName": "agent1:latest"}, {"id": "agent2", "type": "AGENT", "label": "Agent 2", "agentName": "agent2:latest"}, {"id": "end", "type": "END", "label": "End"} ], "edges": [ {"sourceNodeId": "start", "targetNodeId": "agent1"}, {"sourceNodeId": "agent1", "targetNodeId": "agent2"}, {"sourceNodeId": "agent2", "targetNodeId": "end"} ]}Example: Parallel workflow
Section titled “Example: Parallel workflow”Multiple agents run in parallel from the same starting point. Their outputs can converge to a single node:
{ "uid": "parallel-workflow-001", "name": "Parallel Workflow", "nodes": [ {"id": "start", "type": "START", "label": "Start"}, {"id": "agent1", "type": "AGENT", "label": "Agent 1", "agentName": "agent1:latest"}, {"id": "agent2", "type": "AGENT", "label": "Agent 2", "agentName": "agent2:latest"}, {"id": "agent3", "type": "AGENT", "label": "Agent 3", "agentName": "agent3:latest"}, {"id": "end", "type": "END", "label": "End"} ], "edges": [ {"sourceNodeId": "start", "targetNodeId": "agent1"}, {"sourceNodeId": "start", "targetNodeId": "agent2"}, {"sourceNodeId": "start", "targetNodeId": "agent3"}, {"sourceNodeId": "agent1", "targetNodeId": "end"}, {"sourceNodeId": "agent2", "targetNodeId": "end"}, {"sourceNodeId": "agent3", "targetNodeId": "end"} ]}Step 2: Prepare Your Dataset
Section titled “Step 2: Prepare Your Dataset”Each sample represents a test case for the full workflow. Include user_input (what the user sends to the workflow) and reference_outcome (what you expect the workflow to produce). The workflow runner sends the input through all agent nodes in the DAG according to the edges you defined:
{ "samples": [ { "user_input": "My order has not arrived after two weeks.", "reference_outcome": "An apology and a case escalation to the shipping team." }, { "user_input": "What is the status of order #12345?", "reference_outcome": "The order is shipped and arriving tomorrow." } ]}Step 3: Create Your Config
Section titled “Step 3: Create Your Config”The config includes llm_config for LLM credentials, evaluation_config for metrics, and agent_workflow_config for the DAG definition. The agent_workflow_config.config section contains the same DAG structure from Step 1:
llm_config: base_url: "https://gateway.flotorch.cloud/openai/v1" api_key: "your-gateway-key" chat_model: gpt-4o-mini
evaluation_config: metrics: - goal_achievement - ragas:agent_goal_accuracy
agent_workflow_config: dataset_url: "https://your-storage/agent_dataset.json" config: uid: "workflow-001" name: "Sequential Workflow" nodes: - {id: "start", type: "START", label: "Start"} - {id: "agent1", type: "AGENT", label: "Agent 1", agentName: "agent1:latest"} - {id: "agent2", type: "AGENT", label: "Agent 2", agentName: "agent2:latest"} - {id: "end", type: "END", label: "End"} edges: - {sourceNodeId: "start", targetNodeId: "agent1"} - {sourceNodeId: "agent1", targetNodeId: "agent2"} - {sourceNodeId: "agent2", targetNodeId: "end"}Step 4: Run
Section titled “Step 4: Run”From the command line
Section titled “From the command line”Use --agent with the workflow config. Floeval reads the DAG definition, runs the workflow for each sample, and scores the results:
floeval evaluate --agent -c workflow_config.yaml -d agent_dataset.json -o workflow_results.jsonFrom code
Section titled “From code”Create a WorkflowRunner from the DAG config and pass it as agent_runner to AgentEvaluation. The runner executes the full DAG for each sample. Floeval evaluates each agent’s tool calls and outputs, and scores the overall workflow output:
import jsonfrom floeval.api.agent_evaluation import AgentEvaluationfrom floeval.config.schemas.io.agent_dataset import AgentDataset, PartialAgentSamplefrom floeval.config.schemas.io.llm import OpenAIProviderConfigfrom floeval.flotorch import WorkflowRunner
llm_config = OpenAIProviderConfig( base_url="https://gateway.flotorch.cloud/openai/v1", api_key="your-gateway-key", chat_model="gpt-4o-mini",)
with open("workflow_config.json") as f: dag_config = json.load(f)
runner = WorkflowRunner(dag_config=dag_config, llm_config=llm_config)
dataset = AgentDataset(samples=[ PartialAgentSample( user_input="My order has not arrived after two weeks.", reference_outcome="An apology and a case escalation to the shipping team.", ),])
evaluation = AgentEvaluation( dataset=dataset, agent_runner=runner, llm_config=llm_config, metrics=["goal_achievement", "response_coherence", "ragas:agent_goal_accuracy"], default_provider="builtin",)
results = evaluation.run()print("Summary:", results.summary)
for row in results.sample_results: print("Final response:", row.get("final_response")) print("Agent traces:", len(row.get("agent_traces", [])), "nodes")Available Metrics
Section titled “Available Metrics”Floeval evaluates at both the workflow level and the individual agent level:
| Metric | What it measures |
|---|---|
goal_achievement | Did the workflow achieve the intended goal? |
response_coherence | Is the final response consistent with the workflow trace? |
ragas:agent_goal_accuracy | Workflow output vs expected outcome |
ragas:tool_call_accuracy | Were each agent’s tool calls correct? |
Results include per-agent traces so you can see which agents responded and how each contributed to the overall output.
Next Steps
Section titled “Next Steps”- Agent Evaluations — evaluate single agents
- LLM Evaluations — evaluate raw LLM output
- RAG Evaluations — evaluate retrieval-augmented generation