Dynamiq Docs
  • Welcome to Dynamiq
  • Low-Code Builder
    • Chat
    • Basics
    • Connecting Nodes
    • Conditional Nodes and Multiple Outputs
    • Input and Output Transformers
    • Error Handling and Retries
    • LLM Nodes
    • Validator Nodes
    • RAG Nodes
      • Indexing Workflow
        • Pre-processing Nodes
        • Document Splitting
        • Document Embedders
        • Document Writers
      • Inference RAG workflow
        • Text embedders
        • Document retrievers
          • Complex retrievers
        • LLM Answer Generators
    • LLM Agents
      • Basics
      • Guide to Implementing LLM Agents: ReAct and Simple Agents
      • Guide to Agent Orchestration: Linear and Adaptive Orchestrators
      • Guide to Advanced Agent Orchestration: Graph Orchestrator
    • Audio and voice
    • Tools and External Integrations
    • Python Code in Workflows
    • Memory
    • Guardrails
  • Deployments
    • Workflows
      • Tracing Workflow Execution
    • LLMs
      • Fine-tuned Adapters
      • Supported Models
    • Vector Databases
  • Prompts
    • Prompt Playground
  • Connections
  • LLM Fine-tuning
    • Basics
    • Using Adapters
    • Preparing Data
    • Supported Models
    • Parameters Guide
  • Knowledge Bases
  • Evaluations
    • Metrics
      • LLM-as-a-Judge
      • Predefined metrics
        • Faithfulness
        • Context Precision
        • Context Recall
        • Factual Correctness
        • Answer Correctness
      • Python Code Metrics
    • Datasets
    • Evaluation Runs
    • Examples
      • Build Accurate vs. Inaccurate Workflows
  • Examples
    • Building a Search Assistant
      • Approach 1: Single Agent with a Defined Role
      • Approach 2: Adaptive Orchestrator with Multiple Agents
      • Approach 3: Custom Logic Pipeline with a Straightforward Workflow
    • Building a Code Assistant
  • Platform Settings
    • Access Keys
    • Organizations
    • Settings
    • Billing
  • On-premise Deployment
    • AWS
    • IBM
  • Support Center
Powered by GitBook
On this page
  • Why Document Splitting is Important
  • Document Splitter Node
  • Key Features
  • How to Use the Document Splitter
  • Benefits of Document Splitting
  1. Low-Code Builder
  2. RAG Nodes
  3. Indexing Workflow

Document Splitting

PreviousPre-processing NodesNextDocument Embedders

Last updated 6 months ago

Why Document Splitting is Important

Document splitting, or chunking, is a vital step in the indexing workflow. It involves breaking down large documents into smaller, manageable pieces. This process enhances the efficiency and accuracy of information retrieval by allowing the system to focus on relevant sections of a document. By maintaining metadata about the original document, the context is preserved, ensuring that the retrieved information remains meaningful and coherent.

Document Splitter Node

The document splitter node is designed to handle various splitting strategies, providing flexibility in how documents are divided. It receives documents as input and outputs the split documents, while preserving metadata about the original document.

Key Features

Split By Options

  • Character: Splits the document based on a specified number of characters.

  • Word: Divides the document by a set number of words.

  • Sentence: Splits the document into individual sentences.

  • Page: Breaks the document into pages, useful for paginated content.

  • Passage: Divides the document into logical passages or sections.

  • Title: Splits based on titles or headings, ideal for structured documents.

Split Length

Defines the size of each chunk. For example, if splitting by characters, you can specify the number of characters per chunk.

Split Overlap

Allows for overlapping content between chunks, which can be useful for maintaining context across splits.

How to Use the Document Splitter

1. Input

Provide the documents to be split. The splitter will process these documents and divide them according to the selected options.

2. Configuration

Choose the appropriate split by option, set the split length, and determine any overlap needed. These settings will depend on the nature of your documents and the level of detail required for retrieval.

3. Output

The splitter outputs the divided documents, each tagged with metadata that includes information about the original document. This metadata is crucial for maintaining context and ensuring accurate retrieval during the inference phase.

Benefits of Document Splitting

  • Improved Retrieval: Smaller, focused chunks allow for more precise retrieval, enhancing the relevance of the information returned.

  • Scalability: Efficiently handles large volumes of data by breaking them into manageable pieces.

  • Context Preservation: Metadata ensures that the context of the original document is retained, providing meaningful responses.

By effectively utilizing the document splitter, you can optimize your data for retrieval, ensuring that your RAG application delivers accurate and contextually relevant information.

In the next section, we will explore the vectorization process, detailing how to convert text into vector representations for efficient retrieval.