Document Splitting
Last updated
Last updated
Document splitting, or chunking, is a vital step in the indexing workflow. It involves breaking down large documents into smaller, manageable pieces. This process enhances the efficiency and accuracy of information retrieval by allowing the system to focus on relevant sections of a document. By maintaining metadata about the original document, the context is preserved, ensuring that the retrieved information remains meaningful and coherent.
The document splitter node is designed to handle various splitting strategies, providing flexibility in how documents are divided. It receives documents as input and outputs the split documents, while preserving metadata about the original document.
Character: Splits the document based on a specified number of characters.
Word: Divides the document by a set number of words.
Sentence: Splits the document into individual sentences.
Page: Breaks the document into pages, useful for paginated content.
Passage: Divides the document into logical passages or sections.
Title: Splits based on titles or headings, ideal for structured documents.
Defines the size of each chunk. For example, if splitting by characters, you can specify the number of characters per chunk.
Allows for overlapping content between chunks, which can be useful for maintaining context across splits.
Provide the documents to be split. The splitter will process these documents and divide them according to the selected options.
Choose the appropriate split by option, set the split length, and determine any overlap needed. These settings will depend on the nature of your documents and the level of detail required for retrieval.
The splitter outputs the divided documents, each tagged with metadata that includes information about the original document. This metadata is crucial for maintaining context and ensuring accurate retrieval during the inference phase.
Improved Retrieval: Smaller, focused chunks allow for more precise retrieval, enhancing the relevance of the information returned.
Scalability: Efficiently handles large volumes of data by breaking them into manageable pieces.
Context Preservation: Metadata ensures that the context of the original document is retained, providing meaningful responses.
By effectively utilizing the document splitter, you can optimize your data for retrieval, ensuring that your RAG application delivers accurate and contextually relevant information.
In the next section, we will explore the vectorization process, detailing how to convert text into vector representations for efficient retrieval.