Evaluations Overview

Measure the quality of your workflows with metrics, datasets, and evaluation runs — before and after you deploy.

Evaluations let you score the outputs of your workflows against test data, so quality changes are measured instead of guessed. The feature lives in your project under Evaluations and is built from three pieces you compose: Metrics define how to score, Datasets define what to score against, and Evaluation Runs put them together and produce results.

The Evaluations page with the EVALUATIONS, DATASETS, and METRICS tabs

The building blocks

Piece	What it is	Where it lives
Metric	A scoring function — an LLM judge with a rubric, a predefined evaluator, or your own Python code	METRICS tab
Dataset	A versioned table of test items (questions, expected answers, contexts, traces…)	DATASETS tab
Evaluation Run	One execution that scores a dataset version with one or more metrics, optionally running each row through a workflow first	EVALUATIONS tab

A typical loop:

Build a Metric — for example an LLM-as-a-judge with a hallucination rubric, or the predefined AnswerCorrectness evaluator.
Assemble a Dataset of test items and Release a version. Items can be typed in, uploaded as JSON, or captured from production traces with one click.
Start an Evaluation Run that maps dataset fields (and workflow outputs) into the metric inputs, then read the per-row scores in the results table.

Why evaluate

Before deploying — run a candidate workflow version against a released dataset and compare its scores with the version currently in production. Released dataset versions are immutable, so the comparison is apples to apples.
After deploying — production traces can be added to a dataset directly from the trace view, turning real user interactions into regression tests. Datasets whose items carry trace fields can be scored in Dataset only mode without re-running anything.
While iterating on prompts and agents — metrics are versioned, and every evaluation run pins the exact metric and workflow versions it used, so reruns reproduce the same configuration.

Run statuses

An evaluation run moves through pending → running → succeeded, failed, or canceled. Results appear per row in the run's results table, and you can download the full result set as JSON once the run is no longer running.

Online evaluations

The evaluations above are batch: you score a fixed dataset on demand. An online evaluation instead attaches a metric to a deployed App and scores a sampled share of its live traces continuously — measuring quality on real production traffic instead of a curated set.

You attach a saved Metric (pinned to a specific version) to an app and set a sample rate between 0 and 1 — the fraction of incoming traces to score. While the evaluation is enabled, a background consumer samples each new app trace, remaps its fields into the metric's inputs with an optional input transformer, and records a run carrying the trace's score. Each online-evaluation run moves through pending → queued → running → completed or failed.

Online evaluations are managed through the API — there is no dedicated UI surface yet:

POST /v1/apps/{app_id}/evaluations attaches a metric to an app; the same path lists an app's online evaluations.
GET, PUT, and DELETE on /v1/app-evaluations/{evaluation_id} read one, update its sampling and metric version, or remove it.
GET /v1/app-evaluations/{evaluation_id}/runs lists the scored traces, filterable by status.

The metric must belong to the app's project, and the bound metric is immutable once attached — point the evaluation at a newer metric_version_id to score with an updated version.

API surface

Everything in this section is also available on the management API:

POST /v1/metrics, GET /v1/metrics, POST /v1/metrics/test — manage and test metrics.
POST /v1/datasets, POST /v1/datasets/{dataset_id}/versions, POST /v1/dataset-items/from-trace — manage datasets, versions, and items.
POST /v1/evaluations, GET /v1/evaluations/{evaluation_id}/results, POST /v1/evaluations/{evaluation_id}/rerun — start runs and read results.

Each page in this section documents its endpoints next to the UI walkthrough.

The building blocks

Why evaluate

Run statuses

Online evaluations

API surface

Next steps

Metrics

Datasets

Evaluation Runs

On this page