Preparing Data

Prepare Dataset for Fine-Tuning

Fine-tuning a model requires a dataset that is well-structured and labeled. It is a critical step in the fine-tuning process, as the model learns from the data provided and adapts to the specific task or domain.

Dataset Requirements

Before you start preparing your dataset, make sure it meets the following requirements:

  1. The examples in the dataset should be a diverse set of demonstration conversations: This will help the model learn to generate responses that are relevant and engaging.

  2. The dataset should include similar conversations to the requests at inference time in production: When the model is fine-tuned on relevant data, it will be better equipped to generate accurate responses when you deploy it.

  3. The amount of data should be sufficient to train the model effectively: We recommend starting with at least 50-100 representative examples in your dataset and seeing how the model performs and if there is sufficient improvement after fine-tuning. Overall, the more quality data you have, the better the model will perform after the fine-tuning process.

Dataset Structure

Currently, Dynamiq supports the OpenAI-compatible JSONL format for fine-tuning datasets. This structure enables context management in conversations, allowing the model to interpret not only what the user is asking but also to maintain continuity in responses.

The JSON object (which is present in every line of the JSONL file) contains a top-level messages array, where each item in the array is an individual message object. Each message object includes:

  • role: Defines the role of the participant in the conversation. The standard roles are:

    • system: Sets the behavior or persona of the model (e.g., "You are a helpful assistant").

    • user: Represents the input or question from the user.

    • assistant: Represents the model's response.

  • content: Contains the actual message or instruction given by each role.

Here’s an example of how your dataset could look in the JSONL format:

{"messages":[{"role":"system","content":"You are an assistant that creates engaging and compelling product descriptions for online listings."},{"role":"user","content":"Describe a pop-up tent."},{"role":"assistant","content":"Set up camp effortlessly with this convenient pop-up tent. Designed for quick assembly, it opens in seconds and provides ample space for comfortable sleeping or relaxation. With a durable, weather-resistant fabric and mesh windows for ventilation, it’s perfect for camping trips or beach outings. Lightweight and portable, it’s the ideal solution for hassle-free outdoor adventures."}]}
{"messages":[{"role":"system","content":"You are an assistant that creates engaging and compelling product descriptions for online listings."},{"role":"user","content":"Describe a car phone mount."},{"role":"assistant","content":"Keep your phone secure and accessible with this versatile car phone mount. Featuring an adjustable grip and 360-degree rotation, it ensures a perfect view of your device while driving. Its strong suction cup or adhesive base provides a stable hold on various surfaces, making it easy to navigate, take calls, and use GPS hands-free."}]}
{"messages":[{"role":"system","content":"You are an assistant that creates engaging and compelling product descriptions for online listings."},{"role":"user","content":"Describe a heated blanket."},{"role":"assistant","content":"Stay cozy and warm during chilly nights with this luxurious heated blanket. Equipped with adjustable heat settings, it allows you to customize your comfort level with ease. Made from ultra-soft, plush fabric, it provides even heat distribution and features an auto-off function for safety. Perfect for adding warmth to your bed or couch, it’s a must-have for ultimate relaxation."}]}

Train/Validation/Test Split

When preparing your dataset, it's essential to split it into training, validation, and test sets. This division helps evaluate the model's performance during training and fine-tuning, ensuring that it generalizes well to unseen data.

The recommended split ratio is typically 70/10/20, 70/20/10 or 80/10/10 – usually, most of the data is allocated to the train set, and a smaller percentage of the dataset is left for the val/test splits. Such distribution allows the model to learn from a large portion of the data while validating its performance on a smaller subset and testing it on unseen examples.

Dynamiq automatically uses 70/10/20 split for training, validation, and test sets respectively when you upload your dataset for fine-tuning.

Last updated