Evaluations
Score workflow outputs in code with ready-made RAG metrics, the LLMEvaluator for custom judged metrics, and the PythonEvaluator for programmatic checks.
The dynamiq.evaluations package scores model outputs directly in Python — no platform run required. It has three layers: ready-made metric evaluators (faithfulness, answer correctness, BLEU/ROUGE, and more), LLMEvaluator for custom LLM-as-judge metrics, and PythonEvaluator for custom programmatic metrics. The same metrics also power Evaluations on the platform.
Ready-made metrics
All metric classes live in dynamiq.evaluations.metrics. LLM-judged metrics take an llm node (any LLM from dynamiq.nodes.llms); string metrics need no LLM.
| Evaluator | Needs LLM | run(...) inputs | Returns |
|---|---|---|---|
FaithfulnessEvaluator | yes | questions, answers, contexts | results with score and reasoning |
ContextRecallEvaluator | yes | questions, contexts, answers | results with per-question scores |
ContextPrecisionEvaluator | yes | questions, answers, contexts_list | results with per-question scores |
AnswerCorrectnessEvaluator | yes | questions, answers, ground_truth_answers | results with precision/recall-based scores |
FactualCorrectnessEvaluator | yes | answers, contexts | claim-level precision/recall scores |
BleuScoreEvaluator | no | ground_truth_answers, answers | list[float] |
RougeScoreEvaluator | no | ground_truth_answers, answers | list[float] |
ExactMatchEvaluator | no | ground_truth_answers, answers | list[float] |
StringPresenceEvaluator | no | ground_truth_answers, answers | list[float] |
StringSimilarityEvaluator | no | ground_truth_answers, answers (with a DistanceMeasure) | list[float] |
A complete faithfulness check — does the answer stick to the retrieved context?
from dynamiq.evaluations.metrics import FaithfulnessEvaluator
from dynamiq.nodes.llms import OpenAI
llm = OpenAI(model="gpt-4o-mini")
questions = ["Who was Albert Einstein?"]
answers = [
"He was a German-born theoretical physicist, widely acknowledged as one of the "
"most influential physicists of all time, best known for the theory of relativity."
]
contexts = [
"Albert Einstein was a German-born theoretical physicist. "
"He developed the theory of relativity."
]
evaluator = FaithfulnessEvaluator(llm=llm)
output = evaluator.run(questions=questions, answers=answers, contexts=contexts)
for result in output.results:
print(result.score)
print(result.reasoning)String metrics return plain score lists:
from dynamiq.evaluations.metrics import ExactMatchEvaluator
evaluator = ExactMatchEvaluator()
scores = evaluator.run(
ground_truth_answers=["Paris", "London"],
answers=["Paris", "Berlin"],
)
print(scores) # [1.0, 0.0]Runnable scripts for every metric are in examples/components/evaluations/metrics.
Custom LLM-judged metrics with LLMEvaluator
LLMEvaluator builds a judging prompt from your instructions, typed inputs/outputs, and few-shot examples, then runs it over batches. The output is {"results": [...]} with one dict per input row, shaped by your outputs spec:
from dynamiq.evaluations import LLMEvaluator
from dynamiq.nodes.llms import OpenAI
llm = OpenAI(model="gpt-4o-mini")
evaluator = LLMEvaluator(
instructions="Is this answer problematic for children?",
inputs=[{"name": "answers", "type": list[str]}],
outputs=[
{"name": "reasoning", "type": str},
{"name": "score", "type": int},
],
examples=[
{
"inputs": {"answers": "Damn, this is straight outta hell!!!"},
"outputs": {"reasoning": "The answer contains inappropriate language.", "score": 1},
},
{
"inputs": {"answers": "Football is the most popular sport."},
"outputs": {"reasoning": "The answer is appropriate for children.", "score": 0},
},
],
llm=llm,
)
results = evaluator.run(
answers=[
"Football is the most popular sport with around 4 billion followers worldwide",
"Python language was created by Guido van Rossum.",
]
)
print(results) # {'results': [{'reasoning': ..., 'score': 0}, {'reasoning': ..., 'score': 0}]}Add a second input (for example ground_truth) to build reference-based judges — every declared input becomes a keyword argument to run, and all input lists are evaluated row by row.
Custom programmatic metrics with PythonEvaluator
PythonEvaluator executes a user-defined evaluate function inside the same restricted sandbox used by the Python node. Each input dict must supply the function's required parameters; the function returns the score:
from dynamiq.evaluations import PythonEvaluator
user_code = """
def evaluate(answer, expected):
return 1.0 if answer == expected else 0.0
"""
evaluator = PythonEvaluator(code=user_code)
scores = evaluator.run(
input_data_list=[
{"answer": "Paris", "expected": "Paris"},
{"answer": "Madrid", "expected": "Barcelona"},
]
)
print(scores) # [1.0, 0.0]run_single(input_data={...}) scores one row. The code is compiled with restricted globals, must define a callable named evaluate, and may use default parameter values for optional inputs.
Evaluating workflow outputs
A common pattern: run a workflow, pull the answer and retrieved documents out of the result, and score them — straight from the workflow evaluation example:
from dynamiq.evaluations.metrics import ContextRecallEvaluator, FaithfulnessEvaluator
from dynamiq.nodes.llms import OpenAI
question = "How to build an advanced RAG pipeline?"
wf_result = retrieval_wf.run(input_data={"query": question}) # any RAG workflow — see RAG Pipeline
answer = wf_result.output["openai-1"]["output"]["answer"]
documents = wf_result.output["document-retriever-node-1"]["output"]["documents"]
context = " ".join(doc["content"] for doc in documents)
llm = OpenAI(model="gpt-4o-mini")
recall = ContextRecallEvaluator(llm=llm).run(questions=[question], answers=[answer], contexts=[context])
faithfulness = FaithfulnessEvaluator(llm=llm).run(questions=[question], answers=[answer], contexts=[context])
print({
"context_recall": recall.results[0].score,
"faithfulness": faithfulness.results[0].score,
})To run evaluations over datasets with versioned runs and dashboards, use the platform's Evaluations: datasets, metrics, and evaluation runs. The platform metric types map to these same SDK classes.