Dynamiq Docs
  • Welcome to Dynamiq
  • Low-Code Builder
    • Chat
    • Basics
    • Connecting Nodes
    • Conditional Nodes and Multiple Outputs
    • Input and Output Transformers
    • Error Handling and Retries
    • LLM Nodes
    • Validator Nodes
    • RAG Nodes
      • Indexing Workflow
        • Pre-processing Nodes
        • Document Splitting
        • Document Embedders
        • Document Writers
      • Inference RAG workflow
        • Text embedders
        • Document retrievers
          • Complex retrievers
        • LLM Answer Generators
    • LLM Agents
      • Basics
      • Guide to Implementing LLM Agents: ReAct and Simple Agents
      • Guide to Agent Orchestration: Linear and Adaptive Orchestrators
      • Guide to Advanced Agent Orchestration: Graph Orchestrator
    • Audio and voice
    • Tools and External Integrations
    • Python Code in Workflows
    • Memory
    • Guardrails
  • Deployments
    • Workflows
      • Tracing Workflow Execution
    • LLMs
      • Fine-tuned Adapters
      • Supported Models
    • Vector Databases
  • Prompts
    • Prompt Playground
  • Connections
  • LLM Fine-tuning
    • Basics
    • Using Adapters
    • Preparing Data
    • Supported Models
    • Parameters Guide
  • Knowledge Bases
  • Evaluations
    • Metrics
      • LLM-as-a-Judge
      • Predefined metrics
        • Faithfulness
        • Context Precision
        • Context Recall
        • Factual Correctness
        • Answer Correctness
      • Python Code Metrics
    • Datasets
    • Evaluation Runs
    • Examples
      • Build Accurate vs. Inaccurate Workflows
  • Examples
    • Building a Search Assistant
      • Approach 1: Single Agent with a Defined Role
      • Approach 2: Adaptive Orchestrator with Multiple Agents
      • Approach 3: Custom Logic Pipeline with a Straightforward Workflow
    • Building a Code Assistant
  • Platform Settings
    • Access Keys
    • Organizations
    • Settings
    • Billing
  • On-premise Deployment
    • AWS
    • IBM
  • Support Center
Powered by GitBook
On this page
  • Why Pre-processing is Important
  • Pre-processing Options
  • 1. Unstructured Converter
  • 2. LLM PDF Converter
  • 3. LLM Image Converter
  • 4. PDF File Converter
  • 5. PPTX File Converter
  • Choosing the Right Pre-processing Tool
  1. Low-Code Builder
  2. RAG Nodes
  3. Indexing Workflow

Pre-processing Nodes

PreviousIndexing WorkflowNextDocument Splitting

Last updated 1 month ago

Why Pre-processing is Important

Pre-processing is a critical step in the indexing workflow, transforming raw data into a structured format that is ready for further processing. This step is essential for ensuring data quality, consistency, and reliability. By cleaning and organizing the data, pre-processing enhances the effectiveness of the RAG application, allowing it to retrieve and generate more accurate and relevant responses.

Pre-processing Options

1. Unstructured Converter

The Unstructured Converter is a versatile tool designed to handle a wide variety of file formats, making it an invaluable asset in the pre-processing stage. It supports numerous formats, including but not limited to:

  • Text files (TXT)

  • Word documents (DOCX)

  • Spreadsheets (XLSX)

  • Presentations (PPTX)

  • HTML and XML files

  • JSON and CSV files

Key Features

  • Document Creation Mode

    • One-doc-per-file: Treats each file as a single document, ideal for smaller files.

    • One-doc-per-page: Treats each page as a separate document, useful for large documents.

    • One-doc-per-element: Treats each element (e.g., paragraph, table) as a separate document, providing granular control.

  • Converting Strategy:

    • Auto: Automatically selects the best strategy based on the file type and content.

    • Fast: Prioritizes speed, suitable for quick processing needs.

    • Hi_res: Focuses on high-resolution conversion, ensuring detailed and accurate extraction.

    • Ocr_only: Utilizes Optical Character Recognition (OCR) for text extraction from images and scanned documents.

2. LLM PDF Converter

The LLM PDF Converter is specifically designed for extracting text from PDF files, a common format in many industries. It leverages advanced language models to ensure accurate text extraction.

Key Features

  • LLM Selection: Choose from leading language models such as OpenAI, Anthropic, Cohere, and more, depending on your specific needs and preferences.

  • Document Creation Mode: Offers the same flexible options as the Unstructured Converter, allowing for tailored document handling.

  • Extraction Instruction: Customize the extraction process with specific instructions to ensure the desired output format and content.

3. LLM Image Converter

The LLM Image Converter excels at extracting text from images, making it ideal for processing scanned documents, photographs, and other image-based content.

Key Features

  • LLM Selection: Similar to the PDF Converter, select from top language models to optimize text extraction.

  • Document Creation Mode: Provides flexible document handling options to suit various image types and content structures.

  • Extraction Instruction: Tailor the extraction process with detailed instructions to achieve precise results.

4. PDF File Converter

The PDF File Converter is specifically designed for extracting text from PDF files, a common format in many industries.

Key Features

  • Document Creation Mode:

    • One-doc-per-file: Treats each file as a single document, ideal for smaller files.

    • One-doc-per-page: Treats each page as a separate document, useful for large documents.

  • Extraction mode:

    • Plain: Extracts raw text without preserving layout or formatting—ideal for simple content parsing.

    • Layout: Extracts text while preserving original layout and positioning—suitable for structured documents.

5. PPTX File Converter

The PPTX File Converter is designed to extract text content from PowerPoint presentations, enabling efficient processing of slides used in business and research contexts.

Key Features

  • Document Creation Mode:

    • One-doc-per-file: Treats the entire presentation as a single document, ideal for concise slide decks.

    • One-doc-per-slide: Treats each slide as a separate document, useful for detailed analysis or large presentations.

Choosing the Right Pre-processing Tool

Selecting the appropriate pre-processing tool depends on the nature and format of your data. The Unstructured Converter is ideal for diverse file types, offering broad compatibility and flexibility. For PDF and image files, the LLM PDF Converter and LLM Image Converter provide specialized capabilities, ensuring accurate and efficient text extraction.

By effectively configuring these tools, you can ensure that your data is well-prepared for the subsequent steps in the indexing workflow, ultimately enhancing the performance of your RAG application.

In the next section, we will explore the chunking process, detailing how to split documents into manageable pieces.