RAG Evaluations
RAG (Retrieval-Augmented Generation) evaluations validate systems that retrieve relevant documents and use them to generate answers. You provide questions and retrieved contexts; Floeval scores how well the retrieval and generation work together. Use this when your pipeline fetches documents from a knowledge base or vector store and passes them to an LLM to produce an answer.
Step 1: Prepare Your Dataset
Section titled “Step 1: Prepare Your Dataset”RAG datasets extend the LLM format with a contexts field — a list of retrieved document passages for each question. Floeval uses these contexts to evaluate whether the answer is grounded in the retrieved information and whether the right documents were retrieved.
Full dataset — you already have responses and contexts
Section titled “Full dataset — you already have responses and contexts”Use a full dataset when you have both model responses and the documents that were retrieved for each question:
{ "samples": [ { "user_input": "How does photosynthesis work?", "llm_response": "Photosynthesis converts sunlight into energy using chlorophyll in plants.", "contexts": [ "Plants use chlorophyll to capture light.", "Converts CO2 and water into glucose and oxygen." ], "ground_truth": "Converts light into chemical energy" }, { "user_input": "What is machine learning?", "llm_response": "Machine learning is a branch of AI where systems learn from data.", "contexts": [ "ML uses algorithms to find patterns in data.", "Common types include supervised and unsupervised learning." ] } ]}Partial dataset — provide contexts, let Floeval generate responses
Section titled “Partial dataset — provide contexts, let Floeval generate responses”Use a partial dataset when you have questions and retrieved contexts but no model responses yet. Floeval calls your LLM with the question and contexts, generates the response, then scores it:
{ "samples": [ { "user_input": "What is RAG?", "contexts": ["RAG combines document retrieval with language generation."] } ]}Requirements for partial evaluations
Section titled “Requirements for partial evaluations”To run partial RAG evaluations, you must follow these steps:
| Step | What to do |
|---|---|
| 1. Dataset | Omit llm_response from every sample. Include user_input and contexts (required for RAG — Floeval needs them to generate the answer). Optionally add ground_truth for context metrics. |
| 2. Config (CLI) | Add dataset_generation_config with generator_model — the model Floeval will use to generate responses from question + contexts. |
| 3. From code | Pass partial_dataset=True to DatasetLoader.from_samples() and dataset_generator_model to Evaluation(). |
| 4. LLM access | Ensure llm_config is valid — Floeval needs it for generation and for scoring. |
If any of these are missing, the evaluation will fail or behave unexpectedly.
Step 2: Create Your Config
Section titled “Step 2: Create Your Config”The config specifies your LLM credentials and which RAG metrics to run. Start with answer_relevancy and faithfulness for a complete picture of answer quality:
llm_config: base_url: "https://api.openai.com/v1" api_key: "your-api-key" chat_model: gpt-4o-mini embedding_model: text-embedding-3-small
evaluation_config: metrics: - ragas:answer_relevancy - ragas:faithfulnessFor partial datasets, add:
dataset_generation_config: generator_model: gpt-4o-miniStep 3: Run
Section titled “Step 3: Run”From the command line
Section titled “From the command line”Run the evaluation by pointing the CLI at your config and dataset. The CLI auto-detects full vs partial datasets:
floeval evaluate -c config.yaml -d rag_dataset.json -o results.jsonFrom code
Section titled “From code”Use the Evaluation class to run RAG evaluations from code. The setup is the same as LLM evaluations, but your dataset includes contexts and you add context-aware metrics like faithfulness:
import osfrom floeval import Evaluation, DatasetLoaderfrom floeval.config.schemas.io.llm import OpenAIProviderConfig
llm_config = OpenAIProviderConfig( base_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"), api_key=os.getenv("OPENAI_API_KEY", "your-api-key"), chat_model="gpt-4o-mini", embedding_model="text-embedding-3-small",)
dataset = DatasetLoader.from_samples([ { "user_input": "What is RAG?", "llm_response": "RAG stands for Retrieval-Augmented Generation.", "contexts": ["RAG combines retrieval with generation."], }, { "user_input": "How does photosynthesis work?", "llm_response": "Photosynthesis converts sunlight into energy.", "contexts": ["Plants use chlorophyll.", "Converts CO2 and water into glucose."], },], partial_dataset=False)
evaluation = Evaluation( dataset=dataset, llm_config=llm_config, metrics=["answer_relevancy", "faithfulness"], default_provider="ragas",)
results = evaluation.run()print(results.aggregate_scores)Available Metrics
Section titled “Available Metrics”| Metric ID | What it measures | Key fields |
|---|---|---|
ragas:answer_relevancy | How relevant the answer is to the question | user_input, llm_response |
ragas:faithfulness | Whether the answer is grounded in the contexts | llm_response, contexts |
ragas:context_precision | Whether relevant contexts are ranked first | contexts, ground_truth |
ragas:context_recall | How much reference info is covered by contexts | contexts, ground_truth |
ragas:context_entity_recall | Entity coverage in contexts vs reference | contexts, ground_truth |
ragas:noise_sensitivity | Sensitivity to noisy or irrelevant context | contexts, llm_response |
DeepEval
Section titled “DeepEval”| Metric ID | What it measures | Key fields |
|---|---|---|
deepeval:answer_relevancy | Answer relevance | user_input, llm_response |
deepeval:faithfulness | Answer grounded in contexts | llm_response, contexts |
deepeval:contextual_precision | Context precision | contexts, ground_truth |
deepeval:contextual_recall | Context recall | contexts, ground_truth |
deepeval:contextual_relevancy | Overall context relevancy | contexts |
Mixing providers
Section titled “Mixing providers”You can route individual metrics to different scoring backends in the same evaluation. This is useful when you want RAGAS scoring for relevancy and DeepEval scoring for faithfulness:
evaluation = Evaluation( dataset=dataset, llm_config=llm_config, metrics=["ragas:answer_relevancy", "deepeval:faithfulness"],)Custom metrics
Section titled “Custom metrics”You can add custom metrics (see LLM Evaluations) to RAG evaluations. With partial datasets, your custom metric receives the generated response after Floeval produces it from the question and contexts. No extra configuration is needed.
Next Steps
Section titled “Next Steps”- LLM Evaluations — evaluate answers without a retrieval step
- Prompt Evaluations — compare prompts with RAG metrics
- Agent Evaluations — evaluate tool-using agents