Pre-processing Nodes

Why Pre-processing is Important

Pre-processing is a critical step in the indexing workflow, transforming raw data into a structured format that is ready for further processing. This step is essential for ensuring data quality, consistency, and reliability. By cleaning and organizing the data, pre-processing enhances the effectiveness of the RAG application, allowing it to retrieve and generate more accurate and relevant responses.

Pre-processing Options

1. Unstructured Converter

The Unstructured Converter is a versatile tool designed to handle a wide variety of file formats, making it an invaluable asset in the pre-processing stage. It supports numerous formats, including but not limited to:

Text files (TXT)
Word documents (DOCX)
Spreadsheets (XLSX)
Presentations (PPTX)
HTML and XML files
JSON and CSV files

Key Features

Document Creation Mode
- One-doc-per-file: Treats each file as a single document, ideal for smaller files.
- One-doc-per-page: Treats each page as a separate document, useful for large documents.
- One-doc-per-element: Treats each element (e.g., paragraph, table) as a separate document, providing granular control.
Converting Strategy:
- Auto: Automatically selects the best strategy based on the file type and content.
- Fast: Prioritizes speed, suitable for quick processing needs.
- Hi_res: Focuses on high-resolution conversion, ensuring detailed and accurate extraction.
- Ocr_only: Utilizes Optical Character Recognition (OCR) for text extraction from images and scanned documents.

2. LLM PDF Converter

The LLM PDF Converter is specifically designed for extracting text from PDF files, a common format in many industries. It leverages advanced language models to ensure accurate text extraction.

Key Features

LLM Selection: Choose from leading language models such as OpenAI, Anthropic, Cohere, and more, depending on your specific needs and preferences.

Document Creation Mode: Offers the same flexible options as the Unstructured Converter, allowing for tailored document handling.
Extraction Instruction: Customize the extraction process with specific instructions to ensure the desired output format and content.

3. LLM Image Converter

The LLM Image Converter excels at extracting text from images, making it ideal for processing scanned documents, photographs, and other image-based content.

Key Features

LLM Selection: Similar to the PDF Converter, select from top language models to optimize text extraction.
Document Creation Mode: Provides flexible document handling options to suit various image types and content structures.
Extraction Instruction: Tailor the extraction process with detailed instructions to achieve precise results.

4. PDF File Converter

The PDF File Converter is specifically designed for extracting text from PDF files, a common format in many industries.

Key Features

Document Creation Mode:
- One-doc-per-file: Treats each file as a single document, ideal for smaller files.
- One-doc-per-page: Treats each page as a separate document, useful for large documents.
Extraction mode:
- Plain: Extracts raw text without preserving layout or formatting—ideal for simple content parsing.
- Layout: Extracts text while preserving original layout and positioning—suitable for structured documents.

5. PPTX File Converter

The PPTX File Converter is designed to extract text content from PowerPoint presentations, enabling efficient processing of slides used in business and research contexts.

Key Features

Document Creation Mode:
- One-doc-per-file: Treats the entire presentation as a single document, ideal for concise slide decks.
- One-doc-per-slide: Treats each slide as a separate document, useful for detailed analysis or large presentations.

Choosing the Right Pre-processing Tool

Selecting the appropriate pre-processing tool depends on the nature and format of your data. The Unstructured Converter is ideal for diverse file types, offering broad compatibility and flexibility. For PDF and image files, the LLM PDF Converter and LLM Image Converter provide specialized capabilities, ensuring accurate and efficient text extraction.

By effectively configuring these tools, you can ensure that your data is well-prepared for the subsequent steps in the indexing workflow, ultimately enhancing the performance of your RAG application.

In the next section, we will explore the chunking process, detailing how to split documents into manageable pieces.

PreviousIndexing Workflow NextDocument Splitting

Last updated 4 months ago