Pre-processing Nodes
Last updated
Last updated
Pre-processing is a critical step in the indexing workflow, transforming raw data into a structured format that is ready for further processing. This step is essential for ensuring data quality, consistency, and reliability. By cleaning and organizing the data, pre-processing enhances the effectiveness of the RAG application, allowing it to retrieve and generate more accurate and relevant responses.
The Unstructured Converter is a versatile tool designed to handle a wide variety of file formats, making it an invaluable asset in the pre-processing stage. It supports numerous formats, including but not limited to:
Text files (TXT)
Word documents (DOCX)
Spreadsheets (XLSX)
Presentations (PPTX)
HTML and XML files
JSON and CSV files
Document Creation Mode
One-doc-per-file: Treats each file as a single document, ideal for smaller files.
One-doc-per-page: Treats each page as a separate document, useful for large documents.
One-doc-per-element: Treats each element (e.g., paragraph, table) as a separate document, providing granular control.
Converting Strategy:
Auto: Automatically selects the best strategy based on the file type and content.
Fast: Prioritizes speed, suitable for quick processing needs.
Hi_res: Focuses on high-resolution conversion, ensuring detailed and accurate extraction.
Ocr_only: Utilizes Optical Character Recognition (OCR) for text extraction from images and scanned documents.
The LLM PDF Converter is specifically designed for extracting text from PDF files, a common format in many industries. It leverages advanced language models to ensure accurate text extraction.
LLM Selection: Choose from leading language models such as OpenAI, Anthropic, Cohere, and more, depending on your specific needs and preferences.
Document Creation Mode: Offers the same flexible options as the Unstructured Converter, allowing for tailored document handling.
Extraction Instruction: Customize the extraction process with specific instructions to ensure the desired output format and content.
The LLM Image Converter excels at extracting text from images, making it ideal for processing scanned documents, photographs, and other image-based content.
LLM Selection: Similar to the PDF Converter, select from top language models to optimize text extraction.
Document Creation Mode: Provides flexible document handling options to suit various image types and content structures.
Extraction Instruction: Tailor the extraction process with detailed instructions to achieve precise results.
Selecting the appropriate pre-processing tool depends on the nature and format of your data. The Unstructured Converter is ideal for diverse file types, offering broad compatibility and flexibility. For PDF and image files, the LLM PDF Converter and LLM Image Converter provide specialized capabilities, ensuring accurate and efficient text extraction.
By effectively configuring these tools, you can ensure that your data is well-prepared for the subsequent steps in the indexing workflow, ultimately enhancing the performance of your RAG application.
In the next section, we will explore the chunking process, detailing how to split documents into manageable pieces.