Results and metrics

Evaluation list

Under Evaluate → Evaluations, each row shows:

Name — Project name.
Type — LLM, RAG, Prompt, Agent, or Workflow.
Status — Project lifecycle (for example running, completed, failed).
Experiments — How many experiment rows belong to the project.
Created — Creation time.
Results — Opens the project results page.

Project results page

Opening a project shows the experiment results table for that evaluation. Typical columns include:

Id and Status per experiment.
Configuration columns that apply to your type (for example Inferencing model, Evaluation model, Embedding model, System prompt, User prompt, N shot, KNN, KB name, Agent, Workflow—depending on what was used).
Metrics — Selected Ragas/DeepEval/builtin scores as configured.
Duration — How long the run took when available.
Cost — Estimated cost when the workspace exposes pricing for the models used.

Use the Columns control to show or hide optional fields so the table stays readable.

Project status

The UI reflects whether work is still running, completed, or failed, including cases where some experiments finish and others fail. Use this to know when scores are final.

Per-experiment detail

Select an experiment to open its detail view. There you can:

See per-question (or per-row) results: question, ground truth, generated answer, and metric columns.
Review configuration used for that experiment (IDs, prompts, models, retrieval settings).

Use Download results (or equivalent export) to pull data for spreadsheets or external review. Exports include numeric fields such as cost with full precision where applicable.

Deploy as RAG endpoint (RAG evaluations)

For RAG evaluation projects, when an experiment has completed successfully and the configuration is deployable, you can use Deploy as RAG endpoint from the results experience to create a RAG endpoint from the selected experiment’s setup (subject to workspace permissions and gateway behavior).

Tips

If context columns are empty for an LLM-only run, that usually means there was no retrieved context for that row (expected for pure LLM evaluation).
Compare experiments on the same project to isolate the effect of changing models, prompts, or retrieval parameters.