Dynamiq
Advanced

Error Handling & Retries

Configure per-node timeouts, retries with exponential backoff, and failure propagation with the ErrorHandling model.

Every node carries an error_handling field — an ErrorHandling model that controls how long an execution may run, how many times it is retried, how the retry delay grows, and what a failure does to the rest of the flow. The defaults are conservative: no timeout, no retries, and failures propagate.

The ErrorHandling model

from dynamiq.nodes.node import ErrorHandling
from dynamiq.nodes.types import Behavior

error_handling = ErrorHandling(
    timeout_seconds=30.0,
    max_retries=3,
    retry_interval_seconds=2.0,
    backoff_rate=2.0,
    behavior=Behavior.RAISE,
)
timeout_secondsfloat | null
Max seconds per execution attempt. Default null — no timeout. A timed-out attempt counts as a failure and is retried like any other error.
max_retriesint
Number of retries after the first attempt. Default 0 — the node runs max_retries + 1 times in total.
retry_interval_secondsfloat
Base delay between attempts. Default 1.
backoff_ratefloat
Multiplier applied per attempt: delay = retry_interval_seconds * backoff_rate ** attempt. Default 1 (constant delay).
behavior"raise" | "return"
What a final failure does downstream. "raise" (default): dependent nodes are skipped and the workflow fails. "return": dependents still run and receive the failed result in their inputs.

With the values above, a flaky call is attempted up to 4 times with delays of 2s, 4s, and 8s between attempts.

How the retry loop works

execute_with_retry in dynamiq/nodes/node.py wraps every node execution:

  1. Before each attempt, ensure_client() runs — connection-backed nodes detect a closed client and reconnect; a reconnection failure consumes an attempt and is retried with the same backoff.
  2. The attempt runs execute(). With timeout_seconds set, sync runs execute on a thread pool and enforce the timeout on the future; async runs use asyncio.wait_for.
  3. On any exception (including timeout), the error callback fires (on_node_execute_error, visible in Traces), the loop sleeps retry_interval_seconds * backoff_rate ** attempt, and tries again. Async execution uses non-blocking asyncio.sleep.
  4. After the last attempt, the final error is raised and the node returns a RunnableResult with status="failure" and an error carrying the exception type and message.

Cancellation is never retried — a canceled run exits the loop immediately with status="canceled". See Running Workflows & Results for the cancellation API.

If a node has input streaming enabled, its streaming.timeout must be smaller than error_handling.timeout_seconds — the SDK rejects the configuration otherwise, so that the input-wait timeout can fire before the generic execution timeout.

Failure propagation: raise vs return

behavior decides what happens to nodes that depend on a failed node:

  • Behavior.RAISE (default) — dependents are skipped (status="skip"), the skip cascades through the DAG, and the workflow result is failure. The result's error.failed_nodes lists the node(s) that caused it.
  • Behavior.RETURN — dependents execute anyway. The failed dependency's result (status, error) is merged into the dependent's input, so a downstream node can implement a fallback path. The same applies to skipped dependencies.

This is the building block for fallback patterns: give the primary node behavior=Behavior.RETURN, then let a downstream node (for example a Python node or a second LLM) inspect the dependency's status in its input and take over when the primary failed.

Complete example

A workflow with a retried, timeboxed LLM call:

from dynamiq import Workflow
from dynamiq.connections import OpenAI as OpenAIConnection
from dynamiq.flows import Flow
from dynamiq.nodes.llms import OpenAI
from dynamiq.nodes.node import ErrorHandling
from dynamiq.nodes.types import Behavior
from dynamiq.prompts import Message, Prompt

llm = OpenAI(
    id="answerer",
    connection=OpenAIConnection(),
    model="gpt-4o-mini",
    prompt=Prompt(messages=[Message(role="user", content="{{ question }}")]),
    error_handling=ErrorHandling(
        timeout_seconds=30.0,
        max_retries=3,
        retry_interval_seconds=2.0,
        backoff_rate=2.0,
        behavior=Behavior.RAISE,
    ),
)

wf = Workflow(flow=Flow(nodes=[llm]))
result = wf.run(input_data={"question": "What is the capital of France?"})

print(result.status)
if result.error:
    for failed in result.error.failed_nodes:
        print(failed.id, failed.error_message)

The same field works on any node — agents included. The agent error-handling example sets one ErrorHandling on the LLM and a wider one (timeout_seconds=60, max_retries=2) on the Agent itself, since an agent attempt spans the whole reasoning loop.

Where retries do not help

Retries re-run the same input. Two complementary mechanisms cover the rest:

  • Recoverable agent errors — inside an agent loop, tools raise ToolExecutionException to send the error back to the LLM as an observation so it can correct its input, instead of failing the node. See Tools & Function Tools.
  • Crash recovery — for failures you cannot retry inline (process died, hit a rate-limit wall), enable Checkpoints and resume the run from the last completed node.

Next steps

On this page