Prompt Evaluations
Prompt evaluations let you try different instructions (system prompts) on the same questions to see which one makes your LLM answer better. For example: “Answer in one sentence” vs “Answer with a short explanation.” Floeval runs each question through your LLM once with each instruction, gets a response for each, scores them, and shows you which instruction performed better. You can test as many prompts as you want on as many questions as you want.
How It Works
Section titled “How It Works”- You create a prompts file (YAML) — each entry is a different instruction (e.g. prompt1 = “Be concise”, prompt2 = “Be detailed”)
- Your dataset has questions and a list of which prompts to try for each question
- Floeval runs each question with each prompt — so 2 questions × 2 prompts = 4 LLM calls and 4 responses
- Floeval scores each response so you can compare which prompt works best
Prompt evaluations are always partial — Floeval generates responses at runtime. There is no pre-generated llm_response in the dataset.
Requirements for prompt evaluations
Section titled “Requirements for prompt evaluations”| Step | What to do |
|---|---|
| 1. Prompts file | Create a YAML file with named prompts. Each prompt needs a template field. |
| 2. Dataset | Each sample needs user_input and prompt_ids (array of prompt IDs to test). Do not include llm_response. |
| 3. Config | Add prompts_file under evaluation_config and dataset_generation_config with generator_model. |
| 4. From code | Pass partial_dataset=True, dataset_generator_model, and prompts_file to Evaluation(). |
Step 1: Create Your Prompts File
Section titled “Step 1: Create Your Prompts File”Define each prompt variant in a YAML file. Every prompt needs a template field containing the system instruction that Floeval will use when generating responses:
prompts: prompt1: template: "Answer the question directly and concisely." prompt2: template: "Answer the question with a brief explanation of your reasoning."Step 2: Prepare Your Dataset
Section titled “Step 2: Prepare Your Dataset”Each sample has a question (user_input) and a list of prompts to test (prompt_ids). Floeval will run the question through your LLM once for each prompt in the list. For example, if a question has ["prompt1", "prompt2"], Floeval gets 2 responses (one per prompt) and scores both:
{ "samples": [ { "user_input": "What is the capital of France?", "prompt_ids": ["prompt1", "prompt2"] }, { "user_input": "What is RAG in machine learning?", "prompt_ids": ["prompt1", "prompt2"] } ]}The prompt_ids values (prompt1, prompt2) must match the keys in your prompts file. This produces 4 evaluations (2 questions × 2 prompts).
Step 3: Create Your Config
Section titled “Step 3: Create Your Config”The config links everything together: your LLM credentials, the metrics to run, and the path to your prompts file. Add prompts_file under evaluation_config and include dataset_generation_config since Floeval needs to generate responses for each prompt variant:
llm_config: base_url: "https://api.openai.com/v1" api_key: "your-api-key" chat_model: gpt-4o-mini embedding_model: text-embedding-3-small
evaluation_config: metrics: - ragas:answer_relevancy prompts_file: "prompts.yaml"
dataset_generation_config: generator_model: gpt-4o-miniStep 4: Run
Section titled “Step 4: Run”From the command line
Section titled “From the command line”Run the evaluation with your prompt config and dataset. Floeval reads the prompts file, generates a response for each (sample, prompt_id) pair, then scores all of them:
floeval evaluate -c prompt_config.yaml -d partial_dataset.json -o prompt_results.jsonFrom code
Section titled “From code”Pass the prompts_file path to the Evaluation constructor. Results contain one row per (sample, prompt_id) pair, so you can compare scores across prompts:
import osfrom floeval import Evaluation, DatasetLoaderfrom floeval.config.schemas.io.llm import OpenAIProviderConfig
llm_config = OpenAIProviderConfig( base_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"), api_key=os.getenv("OPENAI_API_KEY", "your-api-key"), chat_model="gpt-4o-mini", embedding_model="text-embedding-3-small",)
dataset = DatasetLoader.from_samples([ { "user_input": "Summarize this support ticket.", "prompt_ids": ["prompt1", "prompt2"], },], partial_dataset=True)
evaluation = Evaluation( dataset=dataset, llm_config=llm_config, metrics=["answer_relevancy"], default_provider="ragas", dataset_generator_model="gpt-4o-mini", prompts_file="prompts.yaml",)
results = evaluation.run()for row in results.sample_results: print(row.get("prompt_id"), row.get("metrics", {}))Prompt Evaluation with RAG
Section titled “Prompt Evaluation with RAG”If your prompts use retrieval context, add contexts to each sample in the dataset and include context-aware metrics. This lets you test whether different prompt instructions affect how well the model stays grounded in the retrieved documents:
evaluation_config: metrics: - ragas:answer_relevancy - ragas:faithfulness prompts_file: "prompts.yaml"Available Metrics
Section titled “Available Metrics”The same metrics from LLM and RAG evaluations apply:
| Without RAG | With RAG |
|---|---|
ragas:answer_relevancy | ragas:answer_relevancy |
deepeval:answer_relevancy | ragas:faithfulness |
ragas:context_precision | |
ragas:context_recall |
See RAG Evaluations for the full metrics list.
Custom metrics
Section titled “Custom metrics”Custom metrics (see LLM Evaluations) work with prompt evaluations. Your function receives the generated response for each (sample, prompt_id) pair. No extra configuration is needed.
Next Steps
Section titled “Next Steps”- LLM Evaluations — evaluate raw model answers
- RAG Evaluations — combine prompts with context grounding
- Agent Evaluations — evaluate tool-using agents