Error Handling & Retries
Configure per-node timeouts, retries with exponential backoff, and failure propagation with the ErrorHandling model.
Every node carries an error_handling field — an ErrorHandling model that controls how long an execution may run, how many times it is retried, how the retry delay grows, and what a failure does to the rest of the flow. The defaults are conservative: no timeout, no retries, and failures propagate.
The ErrorHandling model
from dynamiq.nodes.node import ErrorHandling
from dynamiq.nodes.types import Behavior
error_handling = ErrorHandling(
timeout_seconds=30.0,
max_retries=3,
retry_interval_seconds=2.0,
backoff_rate=2.0,
behavior=Behavior.RAISE,
)timeout_secondsfloat | nullmax_retriesintretry_interval_secondsfloatbackoff_ratefloatbehavior"raise" | "return"With the values above, a flaky call is attempted up to 4 times with delays of 2s, 4s, and 8s between attempts.
How the retry loop works
execute_with_retry in dynamiq/nodes/node.py wraps every node execution:
- Before each attempt,
ensure_client()runs — connection-backed nodes detect a closed client and reconnect; a reconnection failure consumes an attempt and is retried with the same backoff. - The attempt runs
execute(). Withtimeout_secondsset, sync runs execute on a thread pool and enforce the timeout on the future; async runs useasyncio.wait_for. - On any exception (including timeout), the error callback fires (
on_node_execute_error, visible in Traces), the loop sleepsretry_interval_seconds * backoff_rate ** attempt, and tries again. Async execution uses non-blockingasyncio.sleep. - After the last attempt, the final error is raised and the node returns a
RunnableResultwithstatus="failure"and anerrorcarrying the exception type and message.
Cancellation is never retried — a canceled run exits the loop immediately with status="canceled". See Running Workflows & Results for the cancellation API.
If a node has input streaming enabled, its streaming.timeout must be smaller than error_handling.timeout_seconds — the SDK rejects the configuration otherwise, so that the input-wait timeout can fire before the generic execution timeout.
Failure propagation: raise vs return
behavior decides what happens to nodes that depend on a failed node:
Behavior.RAISE(default) — dependents are skipped (status="skip"), the skip cascades through the DAG, and the workflow result isfailure. The result'serror.failed_nodeslists the node(s) that caused it.Behavior.RETURN— dependents execute anyway. The failed dependency's result (status, error) is merged into the dependent's input, so a downstream node can implement a fallback path. The same applies to skipped dependencies.
This is the building block for fallback patterns: give the primary node behavior=Behavior.RETURN, then let a downstream node (for example a Python node or a second LLM) inspect the dependency's status in its input and take over when the primary failed.
Complete example
A workflow with a retried, timeboxed LLM call:
from dynamiq import Workflow
from dynamiq.connections import OpenAI as OpenAIConnection
from dynamiq.flows import Flow
from dynamiq.nodes.llms import OpenAI
from dynamiq.nodes.node import ErrorHandling
from dynamiq.nodes.types import Behavior
from dynamiq.prompts import Message, Prompt
llm = OpenAI(
id="answerer",
connection=OpenAIConnection(),
model="gpt-4o-mini",
prompt=Prompt(messages=[Message(role="user", content="{{ question }}")]),
error_handling=ErrorHandling(
timeout_seconds=30.0,
max_retries=3,
retry_interval_seconds=2.0,
backoff_rate=2.0,
behavior=Behavior.RAISE,
),
)
wf = Workflow(flow=Flow(nodes=[llm]))
result = wf.run(input_data={"question": "What is the capital of France?"})
print(result.status)
if result.error:
for failed in result.error.failed_nodes:
print(failed.id, failed.error_message)The same field works on any node — agents included. The agent error-handling example sets one ErrorHandling on the LLM and a wider one (timeout_seconds=60, max_retries=2) on the Agent itself, since an agent attempt spans the whole reasoning loop.
Where retries do not help
Retries re-run the same input. Two complementary mechanisms cover the rest:
- Recoverable agent errors — inside an agent loop, tools raise
ToolExecutionExceptionto send the error back to the LLM as an observation so it can correct its input, instead of failing the node. See Tools & Function Tools. - Crash recovery — for failures you cannot retry inline (process died, hit a rate-limit wall), enable Checkpoints and resume the run from the last completed node.
Next steps
Custom Nodes
Subclass Node to build your own workflow components — input schemas, the execute contract, the execution lifecycle, and connection handling with ConnectionNode.
Caching
Cache node outputs in Redis so repeated runs with identical inputs skip execution — per-node opt-in plus a per-run cache config.