Evaluations

Introduction

Evaluations in the FloTorch console let you benchmark how models, retrieval, prompts, agents, and workflows perform against a dataset (ground-truth Q&A or task data). Work is organized as evaluation projects. Each project has an evaluation type and contains one or more experiments—each experiment is one configuration row (for example a different inferencing model or prompt variant).

In the workspace, open Evaluate to see the overview, your evaluation list, and Create evaluation. Typical flow:

Choose an evaluation type (LLM, RAG, Prompt, Agent, or Workflow).
Complete the wizard: Configuration → Metrics selection → Review, then run.
Open the project’s results page to compare experiments, inspect per-question outputs, and export data.

Evaluation types

Type	Purpose
LLM	Question/answer style runs without retrieval context; compare inferencing models, N-shot, and prompts.
RAG	Retrieval-augmented runs using a knowledge base and dataset; metrics include answer quality and retrieval signals.
Prompt	Evaluate one or more system/user prompt pairs (optionally with RAG-style retrieval, depending on configuration).
Agent	Run evaluations against a published agent with trajectory-style metrics where applicable.
Workflow	Evaluate an agentic workflow (DAG) end-to-end against your dataset.

Available metrics depend on the type. The console registers metrics per category (for example Ragas and DeepEval answer relevancy for LLM-style runs; additional faithfulness and context metrics for RAG; built-in and Ragas metrics for agent/workflow). You pick metrics in the wizard from the set allowed for that evaluation.

Projects and experiments

Evaluation project — Named container for a single “run” you configured (type, dataset, metrics, and overall settings). It appears in Evaluations with status such as running, completed, or failed.
Experiments — Rows inside that project. Each experiment corresponds to a distinct combination (for example a different inferencing model or prompt pair index). The console enforces a maximum number of experiments per project (currently 50).

Permissions

Starting an evaluation run is limited to users who can administer evaluations in the workspace (workspace admins in the console API). Members with more restricted roles may view results where your organization allows it, but cannot start new runs.

Next steps

Creating evaluations — Wizard steps and fields by type.
Results and metrics — Results table, question-level view, export, and RAG deploy.