# LLMs

The Dynamiq platform allows you to deploy and fine-tune open-source large language models (LLMs) such as Meta’s Llama. Follow these steps to deploy an LLM and make it accessible via API.

<figure><img src="/files/6I1PPQWIu2F6JDxi9i7A" alt=""><figcaption></figcaption></figure>

1. Navigate to the **Deployments** tab:
   * Navigate to the **Deployments** section on the Dynamiq platform dashboard.
   * Click on **Add new deployment** in the upper-right corner.
   * Choose **LLM** from the list of deployment types.
2. Configure the LLM deployment:
   * **Name**: Enter a unique name for the deployment.
   * **Description**: Optionally, provide a description to help identify this deployment.
   * **Resource profile**: Select the desired instance type for your deployment from the available options (e.g., g5.2xlarge, g5.4xlarge, etc.). The chosen profile determines the computational resources allocated, including GPU, CPU, and memory specifications.
   * **Model**: Choose the model you wish to deploy (e.g., Meta-Llama 3.1-8B Instruct). A range of models, such as Llama, Mistral, and Microsoft Phi, are available.
   * **Advanced configuration** (optional):

     The advanced configuration section allows you to fine-tune the behavior and performance of your LLM deployment based on your workload and resource requirements. Here’s a breakdown of each option:

     * **Replica Autoscaling**
       * **Min / Max Replicas**: Set minimum and maximum replicas to scale based on load. More replicas improve availability; fewer save costs.
     * **Max Batch Pre-fill Tokens**
       * **Purpose**: Number of tokens prefetched for batching, improving response time.
       * **Default**: 1024. Higher values may improve performance but increase memory use.
     * **Max Batch Total Tokens**
       * **Purpose**: Total tokens queued in a batch before processing. Higher values improve throughput but may add latency.
       * **Default**: 4096.
     * **Max Tokens (per query)**
       * **Purpose**: Limits tokens per query response to control memory use.
       * **Default**: 1024.
     * **Max Input Length (per query)**
       * **Purpose**: Maximum input tokens per query, affecting memory and processing needs.
       * **Default**: 2048.
     * **Quantization**
       * **Purpose**: Reduces model size for efficiency, with slight accuracy trade-offs.
3. Click **Create** to initiate the deployment.

Once the deployment begins, it will initially display a **Pending** status. During this phase, the platform is allocating resources and preparing the deployment. If the deployment is successful, the status updates to **Running**, signalling that the LLM is available and ready to handle requests. If an error occurs during deployment, the status changes to **Failed**, meaning something went wrong and the deployment was unsuccessful.

### Using Deployed LLMs

Once your LLM deployment is in the **Running** status, it is ready to handle API requests. You can find a code example for calling the deployed model directly in the **Endpoint** section of the deployment details page on the Dynamiq platform.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.getdynamiq.ai/deployments/llms.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
