Skip to content

LLM Evaluations

LLM evaluations let you validate and compare different LLMs on your dataset. You provide questions and (optionally) model responses; Floeval scores how well each model answers. Use the results to decide which LLM best suits your use case.


user_input is mandatory for every sample — it is the question or prompt you’re evaluating. For full datasets, you also need llm_response (the model’s answer). Wrap all samples in a "samples" array. Floeval supports both JSON and JSONL formats.

Full dataset — you already have responses

Section titled “Full dataset — you already have responses”

Use a full dataset when you have pre-generated model outputs that you want to score. Each sample must include user_input (required) and llm_response:

{
"samples": [
{
"user_input": "What is Python?",
"llm_response": "Python is a programming language.",
"ground_truth": "A programming language"
},
{
"user_input": "What is RAG?",
"llm_response": "RAG stands for Retrieval-Augmented Generation.",
"contexts": ["RAG combines retrieval with generation."]
}
]
}

ground_truth and contexts are optional. contexts is required only if you use faithfulness.

Partial dataset — let Floeval generate responses

Section titled “Partial dataset — let Floeval generate responses”

Use a partial dataset when you have questions but no answers yet. Omit llm_response and Floeval will call your LLM at runtime to generate responses, then score them automatically:

{
"samples": [
{ "user_input": "What is Python?" },
{ "user_input": "What is RAG?" }
]
}

To run partial evaluations, you must follow these steps:

StepWhat to do
1. DatasetOmit llm_response from every sample. Include only user_input (and optional fields like ground_truth).
2. Config (CLI)Add dataset_generation_config with generator_model — the model Floeval will use to generate responses.
3. From codePass partial_dataset=True to DatasetLoader.from_samples() and dataset_generator_model to Evaluation().
4. LLM accessEnsure llm_config is valid — Floeval needs it both for generation and for metrics that call the model.

If any of these are missing, the evaluation will fail or behave unexpectedly.


The config file tells Floeval which LLM to use and which metrics to run. Create a config.yaml with your LLM credentials and metric selection:

llm_config:
base_url: "https://api.openai.com/v1"
api_key: "your-api-key"
chat_model: gpt-4o-mini
embedding_model: text-embedding-3-small
evaluation_config:
metrics:
- ragas:answer_relevancy

For partial datasets, add dataset_generation_config:

dataset_generation_config:
generator_model: gpt-4o-mini

Point the CLI at your config and dataset files. Floeval auto-detects whether your dataset is full or partial and adjusts accordingly:

Terminal window
floeval evaluate -c config.yaml -d dataset.json -o results.json

Partial datasets: Have two options

  1. One step (generate + evaluate): Run floeval evaluate directly on a partial dataset. Floeval generates responses at runtime and evaluates them in a single run. Best when you want a quick evaluation and don’t need to save the generated responses.

  2. Two steps (generate, then evaluate): First generate a full dataset from your partial one, then evaluate it. Useful when you want to audit generated responses, reuse the same dataset for multiple metric configurations, or version-control the generated data.

Terminal window
# Step 1: Generate full dataset from partial
floeval generate -c config.yaml -d partial_dataset.json -o complete_dataset.json
# Step 2: Evaluate the generated dataset
floeval evaluate -c config.yaml -d complete_dataset.json -o results.json

To integrate evaluation into your application, use the Evaluation class. Build your LLM config, load or construct a dataset, select metrics, and call run():

import os
from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig
llm_config = OpenAIProviderConfig(
base_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
api_key=os.getenv("OPENAI_API_KEY", "your-api-key"),
chat_model="gpt-4o-mini",
embedding_model="text-embedding-3-small",
)
dataset = DatasetLoader.from_samples([
{"user_input": "What is Python?", "llm_response": "Python is a programming language."},
{"user_input": "What is RAG?", "llm_response": "RAG stands for Retrieval-Augmented Generation."},
], partial_dataset=False)
evaluation = Evaluation(
dataset=dataset,
llm_config=llm_config,
metrics=["answer_relevancy"],
default_provider="ragas",
)
results = evaluation.run()
print(results.aggregate_scores)

For partial datasets, use partial_dataset=True and pass dataset_generator_model="gpt-4o-mini".


Metric IDProviderWhat it measures
ragas:answer_relevancyRAGASHow relevant the answer is to the question
deepeval:answer_relevancyDeepEvalAnswer relevance (DeepEval implementation)

You can define your own scoring functions using the @custom_metric decorator. The function receives the response (mapped from llm_response) and returns a float score between 0 and 1:

from floeval.api.metrics.custom import custom_metric
@custom_metric(threshold=0.5)
def response_length(response: str) -> float:
return min(len(response) / 100.0, 1.0)
evaluation = Evaluation(
dataset=dataset,
llm_config=llm_config,
metrics=["ragas:answer_relevancy", response_length],
)

With partial datasets: Custom metrics work the same way. Floeval generates the response first, then passes it to your custom metric. No extra configuration is needed — your function receives the generated llm_response as the response argument.