Dynamiq
Evaluations

Metrics

Define how outputs are scored: LLM-as-a-judge rubrics, predefined RAG evaluators, or your own Python code.

A metric is a reusable scoring function. You create metrics once on the METRICS tab of Evaluations, then attach them to evaluation runs. Metrics are versioned — every edit creates a new version, and runs pin the version they used, so reruns stay reproducible. A metric's type cannot be changed after creation.

The three metric types

Open Evaluations → METRICS and click Add new metric. The dialog asks for a Name and a Metric type:

TypeHow it scoresBest for
LLM-as-a-judgeAn LLM follows your written rubric and returns a scoreSubjective qualities: hallucination, relevance, tone, frustration
PredefinedA built-in evaluator from the Dynamiq Python libraryStandard RAG quality measures with known semantics
CodeA Python evaluate(...) function you writeDeterministic checks: exact match, regex, JSON validity
The Add new metric dialog with Name, the metric type segmented control, and the LLM-as-a-judge configuration

LLM-as-a-judge

Configure the judge model and the rubric:

  • LLM Provider, Model, Connection, and Temperature — the model that performs the judging. The Connection dropdown offers your existing Connections, with + New connection inline.
  • Instructions — the rubric. Use the Template menu to start from a built-in rubric: Custom, Hallucination, Factual Accuracy, Completeness, Clarity and Coherence, Relevance, Language Quality, Ethical Compliance, Originality and Creativity, or User Frustration.

Placeholders written as {{question}}, {{answer}}, {{context}}, and so on become the metric's Inputs — they are listed as labels under the editor, and at run time you map dataset fields or workflow outputs onto each one. A good rubric describes the task, the scoring scale, and instructs the model to return strict JSON like {"score": X} — the built-in templates all follow this pattern.

You can also provide few-shot examples (pairs of inputs and outputs) in the metric config via the API to anchor the judge's scoring.

Predefined

Pick a Metric Preset and the judge LLM it should use. The platform ships five presets, each with fixed inputs you map at run time:

PresetInputsMeasures
AnswerCorrectnessquestions, answers, ground_truth_answersHow close answers are to the ground truth
ContextPrecisionquestions, answers, contexts_listWhether retrieved contexts that mattered rank high
ContextRecallquestions, answers, contextsWhether the contexts cover the ground truth
FactualCorrectnessanswers, contextsClaim-level factual overlap between answer and context
Faithfulnessquestions, answers, contextsWhether the answer is grounded in the contexts

These map to evaluator classes in the Dynamiq Python library (dynamiq.evaluations.metrics.AnswerCorrectnessEvaluator, ContextPrecisionEvaluator, ContextRecallEvaluator, FactualCorrectnessEvaluator, FaithfulnessEvaluator). The library itself contains additional evaluators (BLEU, ROUGE, exact match, string similarity) that you can use from the Python SDK; the five above are the ones exposed as platform presets.

Code

Write a Python function named evaluate in the Source Code editor. Its parameters become the metric's Inputs, and its return value is the score:

def evaluate(answer, expected):
    return 1 if answer == expected else 0

The Template menu offers ready-made examples: Exact Match, Email Presence, Phone Presence, String Presence, Arithmetic Sum, JSON Validity Check, and Check Answer Letter Match. A regex-based template looks like this:

import re

def evaluate(answer):
    # Default email regex pattern
    email_pattern = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
    return 1 if re.search(email_pattern, answer) else 0

Click Create to save the metric. Opening an existing metric from the list shows a read-only Metric preview.

Manage metrics via the API

Create a metric with POST /v1/metrics. The payload is name, project_id, a type of llm_as_a_judge, predefined, or custom, and a type-specific config:

curl -X POST "https://api.getdynamiq.ai/v1/metrics" \
  -H "Authorization: Bearer $DYNAMIQ_ACCESS_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "hallucination-judge",
    "project_id": "<your-project-id>",
    "type": "llm_as_a_judge",
    "config": {
      "instructions": "Score 0-5 how hallucinated the answer is given the context. Question: {{question}} Context: {{context}} Answer: {{answer}}. Respond exactly as {\"score\": X}.",
      "llm": {
        "type": "dynamiq.nodes.llms.OpenAI",
        "model": "gpt-4o-mini",
        "connection_id": "<your-connection-id>"
      }
    }
  }'

config.examples is optional: a list of {"inputs": {...}, "outputs": {...}} few-shot pairs. llm.temperature and llm.max_tokens are also optional.

curl -X POST "https://api.getdynamiq.ai/v1/metrics" \
  -H "Authorization: Bearer $DYNAMIQ_ACCESS_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "answer-correctness",
    "project_id": "<your-project-id>",
    "type": "predefined",
    "config": {
      "type": "dynamiq.evaluations.metrics.AnswerCorrectnessEvaluator",
      "config": {
        "llm": {
          "type": "dynamiq.nodes.llms.OpenAI",
          "model": "gpt-4o-mini",
          "connection_id": "<your-connection-id>"
        }
      }
    }
  }'

Valid config.type values: dynamiq.evaluations.metrics.AnswerCorrectnessEvaluator, ContextPrecisionEvaluator, ContextRecallEvaluator, FactualCorrectnessEvaluator, FaithfulnessEvaluator (same prefix). FactualCorrectnessEvaluator additionally accepts optional mode, beta, atomicity, and coverage fields.

curl -X POST "https://api.getdynamiq.ai/v1/metrics" \
  -H "Authorization: Bearer $DYNAMIQ_ACCESS_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "exact-match",
    "project_id": "<your-project-id>",
    "type": "custom",
    "config": {
      "code": "def evaluate(answer, expected):\n    return 1 if answer == expected else 0"
    }
  }'

Related endpoints: GET /v1/metrics?project_id=... lists metrics, PUT /v1/metrics/{metric_id} updates one (creating a new version — the type must stay the same), DELETE /v1/metrics/{metric_id} removes it, and GET /v1/metrics/{metric_id}/versions lists versions newest-first.

Test a metric

POST /v1/metrics/test runs one or more metric configurations against sample inputs without creating anything — useful for tuning a rubric before saving it. Each entry carries the metric config, a sample input, and an input_transformer whose selector maps input fields onto the metric's parameters:

curl -X POST "https://api.getdynamiq.ai/v1/metrics/test" \
  -H "Authorization: Bearer $DYNAMIQ_ACCESS_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "<your-project-id>",
    "metrics": [
      {
        "id": "1",
        "metric": {
          "type": "llm_as_a_judge",
          "instructions": "Score 1-5 the factual accuracy of the answer. Question: {{question}} Answer: {{answer}} Ground truth: {{ground_truth}}. Respond exactly as {\"score\": X}.",
          "llm": {
            "type": "dynamiq.nodes.llms.OpenAI",
            "model": "gpt-4o-mini",
            "connection_id": "<your-connection-id>"
          }
        },
        "input_transformer": {
          "selector": {
            "question": "$.question",
            "answer": "$.answer",
            "ground_truth": "$.ground_truth"
          }
        },
        "input": {
          "question": "What is the capital of France?",
          "answer": "Paris is the capital of France.",
          "ground_truth": "Paris is the capital of France."
        }
      }
    ]
  }'

The response's results array carries one object per entry with your id, a status, the computed score, and an error field when scoring failed. The same selector syntax ($.field paths) is what you configure as input mappings when wiring metrics into an evaluation run.

Next steps

On this page