Evaluations

Score workflow outputs in code with ready-made RAG metrics, the LLMEvaluator for custom judged metrics, and the PythonEvaluator for programmatic checks.

The dynamiq.evaluations package scores model outputs directly in Python — no platform run required. It has three layers: ready-made metric evaluators (faithfulness, answer correctness, BLEU/ROUGE, and more), LLMEvaluator for custom LLM-as-judge metrics, and PythonEvaluator for custom programmatic metrics. The same metrics also power Evaluations on the platform.

Ready-made metrics

All metric classes live in dynamiq.evaluations.metrics. LLM-judged metrics take an llm node (any LLM from dynamiq.nodes.llms); string metrics need no LLM.

Evaluator	Needs LLM	`run(...)` inputs	Returns
`FaithfulnessEvaluator`	yes	`questions`, `answers`, `contexts`	results with `score` and `reasoning`
`ContextRecallEvaluator`	yes	`questions`, `contexts`, `answers`	results with per-question scores
`ContextPrecisionEvaluator`	yes	`questions`, `answers`, `contexts_list`	results with per-question scores
`AnswerCorrectnessEvaluator`	yes	`questions`, `answers`, `ground_truth_answers`	results with precision/recall-based scores
`FactualCorrectnessEvaluator`	yes	`answers`, `contexts`	claim-level precision/recall scores
`BleuScoreEvaluator`	no	`ground_truth_answers`, `answers`	`list[float]`
`RougeScoreEvaluator`	no	`ground_truth_answers`, `answers`	`list[float]`
`ExactMatchEvaluator`	no	`ground_truth_answers`, `answers`	`list[float]`
`StringPresenceEvaluator`	no	`ground_truth_answers`, `answers`	`list[float]`
`StringSimilarityEvaluator`	no	`ground_truth_answers`, `answers` (with a `DistanceMeasure`)	`list[float]`

A complete faithfulness check — does the answer stick to the retrieved context?

from dynamiq.evaluations.metrics import FaithfulnessEvaluator
from dynamiq.nodes.llms import OpenAI

llm = OpenAI(model="gpt-4o-mini")

questions = ["Who was Albert Einstein?"]
answers = [
    "He was a German-born theoretical physicist, widely acknowledged as one of the "
    "most influential physicists of all time, best known for the theory of relativity."
]
contexts = [
    "Albert Einstein was a German-born theoretical physicist. "
    "He developed the theory of relativity."
]

evaluator = FaithfulnessEvaluator(llm=llm)
output = evaluator.run(questions=questions, answers=answers, contexts=contexts)

for result in output.results:
    print(result.score)
    print(result.reasoning)

String metrics return plain score lists:

from dynamiq.evaluations.metrics import ExactMatchEvaluator

evaluator = ExactMatchEvaluator()
scores = evaluator.run(
    ground_truth_answers=["Paris", "London"],
    answers=["Paris", "Berlin"],
)
print(scores)  # [1.0, 0.0]

Runnable scripts for every metric are in examples/components/evaluations/metrics.

Custom LLM-judged metrics with LLMEvaluator

LLMEvaluator builds a judging prompt from your instructions, typed inputs/outputs, and few-shot examples, then runs it over batches. The output is {"results": [...]} with one dict per input row, shaped by your outputs spec:

from dynamiq.evaluations import LLMEvaluator
from dynamiq.nodes.llms import OpenAI

llm = OpenAI(model="gpt-4o-mini")

evaluator = LLMEvaluator(
    instructions="Is this answer problematic for children?",
    inputs=[{"name": "answers", "type": list[str]}],
    outputs=[
        {"name": "reasoning", "type": str},
        {"name": "score", "type": int},
    ],
    examples=[
        {
            "inputs": {"answers": "Damn, this is straight outta hell!!!"},
            "outputs": {"reasoning": "The answer contains inappropriate language.", "score": 1},
        },
        {
            "inputs": {"answers": "Football is the most popular sport."},
            "outputs": {"reasoning": "The answer is appropriate for children.", "score": 0},
        },
    ],
    llm=llm,
)

results = evaluator.run(
    answers=[
        "Football is the most popular sport with around 4 billion followers worldwide",
        "Python language was created by Guido van Rossum.",
    ]
)
print(results)  # {'results': [{'reasoning': ..., 'score': 0}, {'reasoning': ..., 'score': 0}]}

Add a second input (for example ground_truth) to build reference-based judges — every declared input becomes a keyword argument to run, and all input lists are evaluated row by row.

Custom programmatic metrics with PythonEvaluator

PythonEvaluator executes a user-defined evaluate function inside the same restricted sandbox used by the Python node. Each input dict must supply the function's required parameters; the function returns the score:

from dynamiq.evaluations import PythonEvaluator

user_code = """
def evaluate(answer, expected):
    return 1.0 if answer == expected else 0.0
"""

evaluator = PythonEvaluator(code=user_code)
scores = evaluator.run(
    input_data_list=[
        {"answer": "Paris", "expected": "Paris"},
        {"answer": "Madrid", "expected": "Barcelona"},
    ]
)
print(scores)  # [1.0, 0.0]

run_single(input_data={...}) scores one row. The code is compiled with restricted globals, must define a callable named evaluate, and may use default parameter values for optional inputs.

Evaluating workflow outputs

A common pattern: run a workflow, pull the answer and retrieved documents out of the result, and score them — straight from the workflow evaluation example:

from dynamiq.evaluations.metrics import ContextRecallEvaluator, FaithfulnessEvaluator
from dynamiq.nodes.llms import OpenAI

question = "How to build an advanced RAG pipeline?"
wf_result = retrieval_wf.run(input_data={"query": question})  # any RAG workflow — see RAG Pipeline
answer = wf_result.output["openai-1"]["output"]["answer"]
documents = wf_result.output["document-retriever-node-1"]["output"]["documents"]
context = " ".join(doc["content"] for doc in documents)

llm = OpenAI(model="gpt-4o-mini")
recall = ContextRecallEvaluator(llm=llm).run(questions=[question], answers=[answer], contexts=[context])
faithfulness = FaithfulnessEvaluator(llm=llm).run(questions=[question], answers=[answer], contexts=[context])

print({
    "context_recall": recall.results[0].score,
    "faithfulness": faithfulness.results[0].score,
})

To run evaluations over datasets with versioned runs and dashboards, use the platform's Evaluations: datasets, metrics, and evaluation runs. The platform metric types map to these same SDK classes.

Ready-made metrics

Custom LLM-judged metrics with LLMEvaluator

Custom programmatic metrics with PythonEvaluator

Evaluating workflow outputs

Next steps

Platform Evaluations

RAG Pipeline

Running Workflows & Results

On this page