Document Processing
Convert files into documents and split them into chunks — the full converter and splitter catalog with configuration.
Document processing is the first half of indexing: converters (dynamiq.nodes.converters) turn raw files into Document objects, and splitters (dynamiq.nodes.splitters) cut those documents into retrieval-sized chunks. Both are ordinary workflow nodes, so they compose with embedders and writers as shown in RAG Pipeline.
Converters
| Converter | Input | Notes |
|---|---|---|
PyPDFConverter | Local parsing via PyPDF | |
LLMPDFConverter | Vision-LLM extraction for scanned or layout-heavy PDFs | |
LLMImageConverter | Images | Vision-LLM text extraction |
DOCXFileConverter | Word documents | |
PPTXFileConverter | PowerPoint decks | |
HTMLConverter | HTML files | |
TextFileConverter | Plain text | |
CSVConverter | CSV files | One document per row; you choose the content column |
UnstructuredFileConverter | Many formats | Uses the Unstructured API (requires a connection) |
MultiFileTypeConverter | Mixed batches | Routes each file to a converter by type |
Most converters accept either file_paths (paths on disk) or files (bytes / BytesIO objects) at run time, plus optional metadata that is attached to the produced documents:
from io import BytesIO
from dynamiq.nodes.converters import PyPDFConverter
converter = PyPDFConverter(document_creation_mode="one-doc-per-page")
result = converter.run(
input_data={
"files": [BytesIO(open("example.pdf", "rb").read())],
"metadata": [{"filename": "example.pdf"}],
}
)
documents = result.output["documents"]document_creation_mode controls granularity: "one-doc-per-file" (default) or "one-doc-per-page".
CSV conversion is column-driven instead:
from dynamiq.nodes.converters import CSVConverter
converter = CSVConverter(
content_column="description", # becomes Document.content
metadata_columns=["sku", "category"], # copied into Document.metadata
)
result = converter.run(input_data={"file_paths": ["products.csv"]})Splitters
All splitters take documents in and return documents out, so they drop into the same pipeline position.
DocumentSplitter (unit-based)
The general-purpose splitter cuts by a text unit:
from dynamiq.nodes.splitters.document import DocumentSplitter
splitter = DocumentSplitter(
split_by="sentence", # "word", "sentence", "page", "passage", "title", "character"
split_length=10, # units per chunk (default 10)
split_overlap=1, # units shared by consecutive chunks (default 0)
)The default is split_by="passage" (paragraphs separated by blank lines). Each chunk keeps the source document's metadata plus a source_id pointing back to the original. This is the same splitter the platform's Knowledge Bases use — see Chunking & Embedding.
Structure-aware and advanced splitters
| Splitter | Strategy |
|---|---|
TokenSplitter | Chunks by token count (chunk_size default 512, chunk_overlap default 50, tiktoken cl100k_base encoding) — best match for embedding-model limits |
RecursiveCharacterSplitter | Recursively tries a separator hierarchy (paragraph → sentence → word); optional language presets |
MarkdownHeaderSplitter | Splits on Markdown headers and stores the header path in chunk metadata |
HTMLHeaderSplitter / HTMLSectionSplitter | Same idea for HTML (h1–h6 tags / sections) |
RecursiveJsonSplitter | Splits JSON documents into smaller JSON chunks under max_chunk_size (default 2000) |
CodeSplitter | Language-aware source-code splitting (language default Python) |
SemanticSplitter | Breaks where embedding similarity between sentence groups drops; requires a TextEmbedder |
ContextualSplitter | Anthropic-style contextual retrieval: wraps an inner splitter and uses an LLM to prepend document-level context to each chunk |
AutoSplitter | Routes each document to the best splitter based on metadata and content sniffing, with a configurable fallback strategy |
Two examples:
from dynamiq.nodes.splitters import TokenSplitter
token_splitter = TokenSplitter(chunk_size=512, chunk_overlap=50)from dynamiq.connections import OpenAI as OpenAIConnection
from dynamiq.nodes.llms import OpenAI
from dynamiq.nodes.splitters import ContextualSplitter, TokenSplitter
contextual = ContextualSplitter(
inner_splitter=TokenSplitter(chunk_size=512, chunk_overlap=50),
llm=OpenAI(connection=OpenAIConnection(), model="gpt-4o-mini"),
)Choosing a splitter
- Mixed corpus, minimal tuning:
DocumentSplitterwith sentence or passage splitting, orAutoSplitter. - Hard token budgets (embedding model limits, cost control):
TokenSplitter. - Markdown/HTML documentation where headings matter: the header splitters — the header path in metadata makes excellent retrieval filters.
- Maximum retrieval quality and you can pay for LLM calls at indexing time:
ContextualSplitter.
Putting it together
from io import BytesIO
from dynamiq import Workflow
from dynamiq.nodes.converters import PyPDFConverter
from dynamiq.nodes.splitters import TokenSplitter
wf = Workflow()
converter = PyPDFConverter(document_creation_mode="one-doc-per-page")
splitter = (
TokenSplitter(chunk_size=512, chunk_overlap=50)
.inputs(documents=converter.outputs.documents)
.depends_on(converter)
)
wf.flow.add_nodes(converter, splitter)
result = wf.run(
input_data={
"files": [BytesIO(open("example.pdf", "rb").read())],
"metadata": [{"filename": "example.pdf"}],
}
)
chunks = result.output[splitter.id]["output"]["documents"]From here the chunks go to an embedder and a writer — continue in Embedders & Vector Stores.
Next steps
RAG Pipeline
Build both halves of RAG in the SDK — an indexing flow that converts, splits, embeds, and stores documents, and a retrieval flow that answers questions over them.
Embedders & Vector Stores
Eight embedding providers and eight vector stores — the provider/store matrix with writer configuration for each.