Dynamiq
RAG

Document Processing

Convert files into documents and split them into chunks — the full converter and splitter catalog with configuration.

Document processing is the first half of indexing: converters (dynamiq.nodes.converters) turn raw files into Document objects, and splitters (dynamiq.nodes.splitters) cut those documents into retrieval-sized chunks. Both are ordinary workflow nodes, so they compose with embedders and writers as shown in RAG Pipeline.

Converters

ConverterInputNotes
PyPDFConverterPDFLocal parsing via PyPDF
LLMPDFConverterPDFVision-LLM extraction for scanned or layout-heavy PDFs
LLMImageConverterImagesVision-LLM text extraction
DOCXFileConverterWord documents
PPTXFileConverterPowerPoint decks
HTMLConverterHTML files
TextFileConverterPlain text
CSVConverterCSV filesOne document per row; you choose the content column
UnstructuredFileConverterMany formatsUses the Unstructured API (requires a connection)
MultiFileTypeConverterMixed batchesRoutes each file to a converter by type

Most converters accept either file_paths (paths on disk) or files (bytes / BytesIO objects) at run time, plus optional metadata that is attached to the produced documents:

from io import BytesIO

from dynamiq.nodes.converters import PyPDFConverter

converter = PyPDFConverter(document_creation_mode="one-doc-per-page")

result = converter.run(
    input_data={
        "files": [BytesIO(open("example.pdf", "rb").read())],
        "metadata": [{"filename": "example.pdf"}],
    }
)
documents = result.output["documents"]

document_creation_mode controls granularity: "one-doc-per-file" (default) or "one-doc-per-page".

CSV conversion is column-driven instead:

from dynamiq.nodes.converters import CSVConverter

converter = CSVConverter(
    content_column="description",            # becomes Document.content
    metadata_columns=["sku", "category"],    # copied into Document.metadata
)
result = converter.run(input_data={"file_paths": ["products.csv"]})

Splitters

All splitters take documents in and return documents out, so they drop into the same pipeline position.

DocumentSplitter (unit-based)

The general-purpose splitter cuts by a text unit:

from dynamiq.nodes.splitters.document import DocumentSplitter

splitter = DocumentSplitter(
    split_by="sentence",   # "word", "sentence", "page", "passage", "title", "character"
    split_length=10,       # units per chunk (default 10)
    split_overlap=1,       # units shared by consecutive chunks (default 0)
)

The default is split_by="passage" (paragraphs separated by blank lines). Each chunk keeps the source document's metadata plus a source_id pointing back to the original. This is the same splitter the platform's Knowledge Bases use — see Chunking & Embedding.

Structure-aware and advanced splitters

SplitterStrategy
TokenSplitterChunks by token count (chunk_size default 512, chunk_overlap default 50, tiktoken cl100k_base encoding) — best match for embedding-model limits
RecursiveCharacterSplitterRecursively tries a separator hierarchy (paragraph → sentence → word); optional language presets
MarkdownHeaderSplitterSplits on Markdown headers and stores the header path in chunk metadata
HTMLHeaderSplitter / HTMLSectionSplitterSame idea for HTML (h1h6 tags / sections)
RecursiveJsonSplitterSplits JSON documents into smaller JSON chunks under max_chunk_size (default 2000)
CodeSplitterLanguage-aware source-code splitting (language default Python)
SemanticSplitterBreaks where embedding similarity between sentence groups drops; requires a TextEmbedder
ContextualSplitterAnthropic-style contextual retrieval: wraps an inner splitter and uses an LLM to prepend document-level context to each chunk
AutoSplitterRoutes each document to the best splitter based on metadata and content sniffing, with a configurable fallback strategy

Two examples:

from dynamiq.nodes.splitters import TokenSplitter

token_splitter = TokenSplitter(chunk_size=512, chunk_overlap=50)
from dynamiq.connections import OpenAI as OpenAIConnection
from dynamiq.nodes.llms import OpenAI
from dynamiq.nodes.splitters import ContextualSplitter, TokenSplitter

contextual = ContextualSplitter(
    inner_splitter=TokenSplitter(chunk_size=512, chunk_overlap=50),
    llm=OpenAI(connection=OpenAIConnection(), model="gpt-4o-mini"),
)

Choosing a splitter

  • Mixed corpus, minimal tuning: DocumentSplitter with sentence or passage splitting, or AutoSplitter.
  • Hard token budgets (embedding model limits, cost control): TokenSplitter.
  • Markdown/HTML documentation where headings matter: the header splitters — the header path in metadata makes excellent retrieval filters.
  • Maximum retrieval quality and you can pay for LLM calls at indexing time: ContextualSplitter.

Putting it together

from io import BytesIO

from dynamiq import Workflow
from dynamiq.nodes.converters import PyPDFConverter
from dynamiq.nodes.splitters import TokenSplitter

wf = Workflow()

converter = PyPDFConverter(document_creation_mode="one-doc-per-page")
splitter = (
    TokenSplitter(chunk_size=512, chunk_overlap=50)
    .inputs(documents=converter.outputs.documents)
    .depends_on(converter)
)
wf.flow.add_nodes(converter, splitter)

result = wf.run(
    input_data={
        "files": [BytesIO(open("example.pdf", "rb").read())],
        "metadata": [{"filename": "example.pdf"}],
    }
)
chunks = result.output[splitter.id]["output"]["documents"]

From here the chunks go to an embedder and a writer — continue in Embedders & Vector Stores.

Next steps

On this page