Skip to content

Evaluations

Evaluations in the FloTorch console let you benchmark how models, retrieval, prompts, agents, and workflows perform against a dataset (ground-truth Q&A or task data). Work is organized as evaluation projects. Each project has an evaluation type and contains one or more experiments—each experiment is one configuration row (for example a different inferencing model or prompt variant).

In the workspace, open Evaluate to see the overview, your evaluation list, and Create evaluation. Typical flow:

  1. Choose an evaluation type (LLM, RAG, Prompt, Agent, or Workflow).
  2. Complete the wizard: ConfigurationMetrics selectionReview, then run.
  3. Open the project’s results page to compare experiments, inspect per-question outputs, and export data.

TypePurpose
LLMQuestion/answer style runs without retrieval context; compare inferencing models, N-shot, and prompts.
RAGRetrieval-augmented runs using a knowledge base and dataset; metrics include answer quality and retrieval signals.
PromptEvaluate one or more system/user prompt pairs (optionally with RAG-style retrieval, depending on configuration).
AgentRun evaluations against a published agent with trajectory-style metrics where applicable.
WorkflowEvaluate an agentic workflow (DAG) end-to-end against your dataset.

Available metrics depend on the type. The console registers metrics per category (for example Ragas and DeepEval answer relevancy for LLM-style runs; additional faithfulness and context metrics for RAG; built-in and Ragas metrics for agent/workflow). You pick metrics in the wizard from the set allowed for that evaluation.


  • Evaluation project — Named container for a single “run” you configured (type, dataset, metrics, and overall settings). It appears in Evaluations with status such as running, completed, or failed.
  • Experiments — Rows inside that project. Each experiment corresponds to a distinct combination (for example a different inferencing model or prompt pair index). The console enforces a maximum number of experiments per project (currently 50).

Starting an evaluation run is limited to users who can administer evaluations in the workspace (workspace admins in the console API). Members with more restricted roles may view results where your organization allows it, but cannot start new runs.