Skip to content

Results and metrics

Under EvaluateEvaluations, each row shows:

  • Name — Project name.
  • Type — LLM, RAG, Prompt, Agent, or Workflow.
  • Status — Project lifecycle (for example running, completed, failed).
  • Experiments — How many experiment rows belong to the project.
  • Created — Creation time.
  • Results — Opens the project results page.

Opening a project shows the experiment results table for that evaluation. Typical columns include:

  • Id and Status per experiment.
  • Configuration columns that apply to your type (for example Inferencing model, Evaluation model, Embedding model, System prompt, User prompt, N shot, KNN, KB name, Agent, Workflow—depending on what was used).
  • Metrics — Selected Ragas/DeepEval/builtin scores as configured.
  • Duration — How long the run took when available.
  • Cost — Estimated cost when the workspace exposes pricing for the models used.

Use the Columns control to show or hide optional fields so the table stays readable.

The UI reflects whether work is still running, completed, or failed, including cases where some experiments finish and others fail. Use this to know when scores are final.


Select an experiment to open its detail view. There you can:

  • See per-question (or per-row) results: question, ground truth, generated answer, and metric columns.
  • Review configuration used for that experiment (IDs, prompts, models, retrieval settings).

Use Download results (or equivalent export) to pull data for spreadsheets or external review. Exports include numeric fields such as cost with full precision where applicable.


For RAG evaluation projects, when an experiment has completed successfully and the configuration is deployable, you can use Deploy as RAG endpoint from the results experience to create a RAG endpoint from the selected experiment’s setup (subject to workspace permissions and gateway behavior).


  • If context columns are empty for an LLM-only run, that usually means there was no retrieved context for that row (expected for pure LLM evaluation).
  • Compare experiments on the same project to isolate the effect of changing models, prompts, or retrieval parameters.