Preparing Data
Prepare Dataset for Fine-Tuning
Fine-tuning a model requires a dataset that is well-structured and labeled. It is a critical step in the fine-tuning process, as the model learns from the data provided and adapts to the specific task or domain.
Dataset Requirements
Before you start preparing your dataset, make sure it meets the following requirements:
The examples in the dataset should be a diverse set of demonstration conversations: This will help the model learn to generate responses that are relevant and engaging.
The dataset should include similar conversations to the requests at inference time in production: When the model is fine-tuned on relevant data, it will be better equipped to generate accurate responses when you deploy it.
The amount of data should be sufficient to train the model effectively: We recommend starting with at least 50-100 representative examples in your dataset and seeing how the model performs and if there is sufficient improvement after fine-tuning. Overall, the more quality data you have, the better the model will perform after the fine-tuning process.
Dataset Structure
Currently, Dynamiq supports the OpenAI-compatible JSONL format for fine-tuning datasets. This structure enables context management in conversations, allowing the model to interpret not only what the user is asking but also to maintain continuity in responses.
The JSON object (which is present in every line of the JSONL file) contains a top-level messages
array, where each item in the array is an individual message object. Each message object includes:
role
: Defines the role of the participant in the conversation. The standard roles are:system
: Sets the behavior or persona of the model (e.g., "You are a helpful assistant").user
: Represents the input or question from the user.assistant
: Represents the model's response.
content
: Contains the actual message or instruction given by each role.
Here’s an example of how your dataset could look in the JSONL format:
Train/Validation/Test Split
When preparing your dataset, it's essential to split it into training, validation, and test sets. This division helps evaluate the model's performance during training and fine-tuning, ensuring that it generalizes well to unseen data.
The recommended split ratio is typically 70/10/20
, 70/20/10
or 80/10/10
– usually, most of the data is allocated to the train set, and a smaller percentage of the dataset is left for the val/test splits. Such distribution allows the model to learn from a large portion of the data while validating its performance on a smaller subset and testing it on unseen examples.
Dynamiq automatically uses 70/10/20
split for training, validation, and test sets respectively when you upload your dataset for fine-tuning.
Last updated