Document Writers
Last updated
Last updated
In the indexing workflow of a Retrieval-Augmented Generation (RAG) application, document writers are essential for storing vectorized data, ensuring efficient retrieval during the inference phase. By organizing and saving data in a structured manner, document writers enable the system to quickly access and utilize relevant information, enhancing the overall performance and accuracy of the application.
There are many options available in Dynamiq for document writers. Let's delve into the details of each to understand their unique features and configurations.
Configuration
Name: Define a unique name for the writer to identify it within your workflow.
Connection: Establish a connection to Weaviate, a vector database optimized for storing and retrieving vectorized data.
Index Name: Specify the index name where the data will be stored. This helps in organizing and retrieving data efficiently.
Options:
Create if not exists: Automatically creates the index if it doesn't already exist, ensuring seamless data storage.
Advanced configuration:
Content Key: Specify custom field name used to store content in the storage.
Configuration
Name: Assign a name to the writer for easy identification.
Connection: Set up a connection to Pinecone, a scalable vector database service.
Index Name: Enter the index name to organize your data.
Embedding Dimension: Default is 1536, defining the size of the vector space. This affects the granularity of data representation.
Metric: Choose a metric, e.g., cosine, to determine how similarity is calculated between vectors.
Namespace: Define a namespace to segment data within the index, allowing for better organization.
Batch Size: Set the batch size for data writing, which can optimize performance by processing multiple entries at once.
Options:
Create if not exists: Ensures the index is created if it doesn't exist, facilitating uninterrupted data storage.
Index Type: There are two deployment options:
Serverless: Requires specifying the Cloud URL and Region for optimal data locality and access speed.
Pod: Requires specifying the Environment, Pod Type, and number of Pods for deployment.
Depending on the chosen deployment option, provide the related fields when "Create if not exists" is enabled.
Advanced configuration:
Content Key: Specify the custom field name used to store content in the storage.
Configuration:
Name: Provide a name for the writer to distinguish it in your setup.
Connection: Connect to Chroma, a service for managing vector data.
Index Name: Specify the index name for data storage.
Options:
Create if not exists: Automatically sets up the index if it's not present, ensuring smooth data operations.
Configuration
Name: Set a name for the writer for easy reference.
Connection: Establish a connection to Qdrant, a high-performance vector database.
Index Name: Enter the index name to categorize your data.
Embedding Dimension: Default is 1536, which defines the vector size and affects data detail.
Metric: Choose a metric, such as cosine, to measure vector similarity.
Options:
Create if not exists: Automatically creates the index if needed, ensuring continuous data flow.
Advanced configuration:
Content Key: Specify custom field name used to store content in the storage.
Configuration
Name: Set a name for the writer for easy reference.
Connection: Establish a connection to Milvus, a highly performant, scalable vector database.
Index Name: Enter the index name to categorize your data.
Options:
Create if not exists: Automatically creates the index if needed, ensuring continuous data flow.
Advanced configuration:
Content Key: Specify a unique name for the field in the storage used to keep content.
Embedding key: Specify a unique name for the field in the storage used to keep the vector.
Configuration
Name: Set a name for the writer for easy reference.
Connection: Establish a connection to Elasticsearch, distributed search and analytics engine.
Index Name: Enter the index name to categorize your data.
Embedding Dimension: Default is 1536, which defines the vector size and affects data detail.
Similarity: Choose a metric, e.g., cosine, to determine how similarity is calculated between vectors.
Write Batch Size: Defines the number of records processed and written in a single batch.
Options:
Create if not exists: Automatically creates the index if needed.
Advanced configuration:
Content Key: Specify a unique name for the field in the storage used to keep content.
Embedding key: Specify a unique name for the field in the storage used to keep the vector.
PGvector Writer:
Configuration
Name: Set a name for the writer for easy reference.
Connection: Establish a connection to pgvector, open-source vector similarity search for Postgres.
Index Name: Enter the index name to categorize your data.
Table Name: Enter the name of the table where the vectors will be stored.
Schema Name: Enter the name of the schema in the database.
Embedding Dimension: Default is 1536, which defines the vector size and affects data detail.
Metric: Choose a metric, e.g., cosine, to determine how similarity is calculated between vectors.
Index Method: Choose the indexing approach used for vector search.
Keyword Index name: Enter the name of the index for keyword-based search.
Options:
Create extension: Enable automatic creation of the pgvector extension.
Create if not exists: Automatically creates the index if needed.
Advanced configuration:
Content Key: Specify a unique name for the field in the storage used to keep content.
Embedding key: Specify a unique name for the field in the storage used to keep the vector.
Input:
Provide the vectorized documents from the previous vectorization step.
Configuration:
Select the appropriate writer based on your storage requirements.
Configure the necessary parameters such as connection, index name, and embedding dimensions.
Output:
The writer stores the vectorized data, making it accessible for retrieval during the inference phase.
Efficient Storage: Optimizes data storage for quick retrieval.
Scalability: Handles large datasets, supporting extensive knowledge bases.
Flexibility: Offers various configurations to suit different storage needs.
By effectively utilizing document writers, you can ensure that your RAG application is equipped to deliver precise and contextually relevant information efficiently.