Skip to content

Prompt Evaluations

Prompt evaluations let you try different instructions (system prompts) on the same questions to see which one makes your LLM answer better. For example: “Answer in one sentence” vs “Answer with a short explanation.” Floeval runs each question through your LLM once with each instruction, gets a response for each, scores them, and shows you which instruction performed better. You can test as many prompts as you want on as many questions as you want.


  1. You create a prompts file (YAML) — each entry is a different instruction (e.g. prompt1 = “Be concise”, prompt2 = “Be detailed”)
  2. Your dataset has questions and a list of which prompts to try for each question
  3. Floeval runs each question with each prompt — so 2 questions × 2 prompts = 4 LLM calls and 4 responses
  4. Floeval scores each response so you can compare which prompt works best

Prompt evaluations are always partial — Floeval generates responses at runtime. There is no pre-generated llm_response in the dataset.

StepWhat to do
1. Prompts fileCreate a YAML file with named prompts. Each prompt needs a template field.
2. DatasetEach sample needs user_input and prompt_ids (array of prompt IDs to test). Do not include llm_response.
3. ConfigAdd prompts_file under evaluation_config and dataset_generation_config with generator_model.
4. From codePass partial_dataset=True, dataset_generator_model, and prompts_file to Evaluation().

Define each prompt variant in a YAML file. Every prompt needs a template field containing the system instruction that Floeval will use when generating responses:

prompts.yaml
prompts:
prompt1:
template: "Answer the question directly and concisely."
prompt2:
template: "Answer the question with a brief explanation of your reasoning."

Each sample has a question (user_input) and a list of prompts to test (prompt_ids). Floeval will run the question through your LLM once for each prompt in the list. For example, if a question has ["prompt1", "prompt2"], Floeval gets 2 responses (one per prompt) and scores both:

{
"samples": [
{
"user_input": "What is the capital of France?",
"prompt_ids": ["prompt1", "prompt2"]
},
{
"user_input": "What is RAG in machine learning?",
"prompt_ids": ["prompt1", "prompt2"]
}
]
}

The prompt_ids values (prompt1, prompt2) must match the keys in your prompts file. This produces 4 evaluations (2 questions × 2 prompts).


The config links everything together: your LLM credentials, the metrics to run, and the path to your prompts file. Add prompts_file under evaluation_config and include dataset_generation_config since Floeval needs to generate responses for each prompt variant:

prompt_config.yaml
llm_config:
base_url: "https://api.openai.com/v1"
api_key: "your-api-key"
chat_model: gpt-4o-mini
embedding_model: text-embedding-3-small
evaluation_config:
metrics:
- ragas:answer_relevancy
prompts_file: "prompts.yaml"
dataset_generation_config:
generator_model: gpt-4o-mini

Run the evaluation with your prompt config and dataset. Floeval reads the prompts file, generates a response for each (sample, prompt_id) pair, then scores all of them:

Terminal window
floeval evaluate -c prompt_config.yaml -d partial_dataset.json -o prompt_results.json

Pass the prompts_file path to the Evaluation constructor. Results contain one row per (sample, prompt_id) pair, so you can compare scores across prompts:

import os
from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig
llm_config = OpenAIProviderConfig(
base_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
api_key=os.getenv("OPENAI_API_KEY", "your-api-key"),
chat_model="gpt-4o-mini",
embedding_model="text-embedding-3-small",
)
dataset = DatasetLoader.from_samples([
{
"user_input": "Summarize this support ticket.",
"prompt_ids": ["prompt1", "prompt2"],
},
], partial_dataset=True)
evaluation = Evaluation(
dataset=dataset,
llm_config=llm_config,
metrics=["answer_relevancy"],
default_provider="ragas",
dataset_generator_model="gpt-4o-mini",
prompts_file="prompts.yaml",
)
results = evaluation.run()
for row in results.sample_results:
print(row.get("prompt_id"), row.get("metrics", {}))

If your prompts use retrieval context, add contexts to each sample in the dataset and include context-aware metrics. This lets you test whether different prompt instructions affect how well the model stays grounded in the retrieved documents:

evaluation_config:
metrics:
- ragas:answer_relevancy
- ragas:faithfulness
prompts_file: "prompts.yaml"

The same metrics from LLM and RAG evaluations apply:

Without RAGWith RAG
ragas:answer_relevancyragas:answer_relevancy
deepeval:answer_relevancyragas:faithfulness
ragas:context_precision
ragas:context_recall

See RAG Evaluations for the full metrics list.

Custom metrics (see LLM Evaluations) work with prompt evaluations. Your function receives the generated response for each (sample, prompt_id) pair. No extra configuration is needed.