Prompt Evaluations

Prompt evaluations let you try different instructions (system prompts) on the same questions to see which one makes your LLM answer better. For example: “Answer in one sentence” vs “Answer with a short explanation.” Floeval runs each question through your LLM once with each instruction, gets a response for each, scores them, and shows you which instruction performed better. You can test as many prompts as you want on as many questions as you want.

How It Works

You create a prompts file (YAML) — each entry is a different instruction (e.g. prompt1 = “Be concise”, prompt2 = “Be detailed”)
Your dataset has questions and a list of which prompts to try for each question
Floeval runs each question with each prompt — so 2 questions × 2 prompts = 4 LLM calls and 4 responses
Floeval scores each response so you can compare which prompt works best

Prompt evaluations are always partial — Floeval generates responses at runtime. There is no pre-generated llm_response in the dataset.

Requirements for prompt evaluations

Step	What to do
1. Prompts file	Create a YAML file with named prompts. Each prompt needs a `template` field.
2. Dataset	Each sample needs `user_input` and `prompt_ids` (array of prompt IDs to test). Do not include `llm_response`.
3. Config	Add `prompts_file` under `evaluation_config` and `dataset_generation_config` with `generator_model`.
4. From code	Pass `partial_dataset=True`, `dataset_generator_model`, and `prompts_file` to `Evaluation()`.

Step 1: Create Your Prompts File

Define each prompt variant in a YAML file. Every prompt needs a template field containing the system instruction that Floeval will use when generating responses:

prompts:
  prompt1:
    template: "Answer the question directly and concisely."
  prompt2:
    template: "Answer the question with a brief explanation of your reasoning."

Step 2: Prepare Your Dataset

Each sample has a question (user_input) and a list of prompts to test (prompt_ids). Floeval will run the question through your LLM once for each prompt in the list. For example, if a question has ["prompt1", "prompt2"], Floeval gets 2 responses (one per prompt) and scores both:

{
  "samples": [
    {
      "user_input": "What is the capital of France?",
      "prompt_ids": ["prompt1", "prompt2"]
    },
    {
      "user_input": "What is RAG in machine learning?",
      "prompt_ids": ["prompt1", "prompt2"]
    }
  ]
}

The prompt_ids values (prompt1, prompt2) must match the keys in your prompts file. This produces 4 evaluations (2 questions × 2 prompts).

Step 3: Create Your Config

The config links everything together: your LLM credentials, the metrics to run, and the path to your prompts file. Add prompts_file under evaluation_config and include dataset_generation_config since Floeval needs to generate responses for each prompt variant:

llm_config:
  base_url: "https://api.openai.com/v1"
  api_key: "your-api-key"
  chat_model: gpt-4o-mini
  embedding_model: text-embedding-3-small

evaluation_config:
  metrics:
    - ragas:answer_relevancy
  prompts_file: "prompts.yaml"

dataset_generation_config:
  generator_model: gpt-4o-mini

Step 4: Run

From the command line

Run the evaluation with your prompt config and dataset. Floeval reads the prompts file, generates a response for each (sample, prompt_id) pair, then scores all of them:

floeval evaluate -c prompt_config.yaml -d partial_dataset.json -o prompt_results.json

From code

Pass the prompts_file path to the Evaluation constructor. Results contain one row per (sample, prompt_id) pair, so you can compare scores across prompts:

import os
from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig

llm_config = OpenAIProviderConfig(
    base_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
    api_key=os.getenv("OPENAI_API_KEY", "your-api-key"),
    chat_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
)

dataset = DatasetLoader.from_samples([
    {
        "user_input": "Summarize this support ticket.",
        "prompt_ids": ["prompt1", "prompt2"],
    },
], partial_dataset=True)

evaluation = Evaluation(
    dataset=dataset,
    llm_config=llm_config,
    metrics=["answer_relevancy"],
    default_provider="ragas",
    dataset_generator_model="gpt-4o-mini",
    prompts_file="prompts.yaml",
)

results = evaluation.run()
for row in results.sample_results:
    print(row.get("prompt_id"), row.get("metrics", {}))

Prompt Evaluation with RAG

If your prompts use retrieval context, add contexts to each sample in the dataset and include context-aware metrics. This lets you test whether different prompt instructions affect how well the model stays grounded in the retrieved documents:

evaluation_config:
  metrics:
    - ragas:answer_relevancy
    - ragas:faithfulness
  prompts_file: "prompts.yaml"

Available Metrics

The same metrics from LLM and RAG evaluations apply:

Without RAG	With RAG
`ragas:answer_relevancy`	`ragas:answer_relevancy`
`deepeval:answer_relevancy`	`ragas:faithfulness`
	`ragas:context_precision`
	`ragas:context_recall`

See RAG Evaluations for the full metrics list.

Custom metrics

Custom metrics (see LLM Evaluations) work with prompt evaluations. Your function receives the generated response for each (sample, prompt_id) pair. No extra configuration is needed.

Next Steps

LLM Evaluations — evaluate raw model answers
RAG Evaluations — combine prompts with context grounding
Agent Evaluations — evaluate tool-using agents