# LLM-as-a-Judge

### Introducing "LLM-as-a-Judge" Metrics

"LLM-as-a-Judge" metrics leverage large language models (LLMs) as powerful evaluation tools, enabling automated, consistent, and customizable assessment of AI-generated outputs. By prompting an LLM to judge aspects like answer correctness, relevance, toxicity, grammatical accuracy, or faithfulness, you quickly gain reliable feedback on your Agents and RAG workflows. Dynamiq makes it simple—just define your evaluation criteria and let the LLM perform objective, scalable evaluations for rapid iteration and improved performance.

<figure><img src="/files/8sQmwxtsG5xNmRpTsQbM" alt=""><figcaption></figcaption></figure>

### Creating a Metric

To create a metric, navigate to the **Evaluations** section in your Dynamiq portal. Follow these steps:

1. **Go to Evaluations -> Metrics -> Create a Metric**: This path will lead you to the interface where you can define new metrics.
2. **Explore Existing Templates**: Dynamiq offers a variety of metric templates such as Factual Accuracy, Completeness, Clarity and Coherence, Relevance, Language Quality, Ethical Compliance, and Originality and Creativity. These templates can serve as inspiration for crafting your prompts.
3. **Add a New Metric**: Click on the **Add new metric** button. You will see a form where you can specify the details of your metric.
   * **Name**: Enter a descriptive name for your metric.
   * **Instructions**: Choose from the available templates or create a custom prompt.
   * **LLM**: Select the LLM provider and model that will be used for metric calculation. Dynamiq supports seamless integration with providers like OpenAI.
   * **Connection**: Establish a connection to the selected LLM model.
   * **Temperature**: Adjust the temperature setting to control the randomness of the model's output.
4. **Create**: Once all fields are filled, click the **Create** button to finalize your metric.

<figure><img src="/files/Owp7Dz33XL3FJh8RaFa3" alt=""><figcaption></figcaption></figure>

### Creating an Evaluation Dataset

A well-prepared dataset is crucial for assessing the performance of AI workflows and ensuring that evaluation metrics capture the nuances of different answers. Here's how to create and manage your evaluation dataset in Dynamiq.

#### Steps to Create an Evaluation Dataset

1. **Navigate to Datasets**: In the Dynamiq portal, go to the **Evaluations** section and select **Datasets**. This is where you can manage your datasets.
2. **Add New Dataset**: Click on the **Add new dataset** button to start creating a new dataset.
   * **Name**: Enter a descriptive name for your dataset.
   * **Description**: Provide a brief description of the dataset's purpose and contents.
   * **Upload from File**: You can upload your dataset in JSON format. Click on the upload area or drag and drop your JSON file. If you need a reference, download the **Sample JSON** to see the required format.
3. **JSON Structure**: Your JSON file should include essential data such as input prompts and desired outputs. Here's an example structure:

   ```json
   [
     {
       "question": "What is the capital of France?",
       "context": "France, a country in Western Europe, has its capital in Paris, which is renowned for its art, culture, and the iconic Eiffel Tower.",
       "ground_truth_answer": "Paris"
     },
     {
       "question": "Who developed the theory of relativity?",
       "context": "The theory of relativity, which revolutionized our understanding of space, time, and gravity, was developed by the physicist Albert Einstein in the early 20th century.",
       "ground_truth_answer": "Albert Einstein"
     }
     // Add more entries as needed
   ]
   ```
4. **Create**: Once your file is uploaded, click the **Create** button to finalize your dataset.

<figure><img src="/files/HdDuY0J6cDPIZDIZxmdX" alt=""><figcaption></figcaption></figure>

Once the dataset is reviewed, it's important to release it for evaluation use.

<figure><img src="/files/rEGNpHakHTmg0YWl8jVA" alt=""><figcaption></figcaption></figure>

#### Reviewing Your Dataset

After uploading, you can review your dataset entries:

* **Dataset Overview**: View the dataset's version, creator, and last edited details.
* **Dataset Entries**: Examine each entry's context, question, and ground truth answer to ensure accuracy and completeness.
* **Upload New Version**: If updates are needed, you can upload a new version of your dataset.

By following these steps, you can create a comprehensive dataset that will enhance the evaluation process, ensuring your AI workflows are thoroughly tested and validated.

### Creating Example Workflows to Showcase Metrics Power

To demonstrate the effectiveness of LLM-as-a-judge metrics, we'll create two workflows: one that generates accurate answers and another that produces answers with mistakes. This will highlight how metrics can differentiate between high-quality and low-quality outputs.

<figure><img src="/files/3zUuxfEhuIjGGBQ2wP0D" alt=""><figcaption></figcaption></figure>

#### Prompt 1: Accurate Answers

```markdown
You are an expert assistant providing precise and accurate answers to questions. 
Ensure that your answers are correct, concise, and, where appropriate, include brief explanations to enhance understanding.

Instructions:
- Provide accurate information in response to the question.
- Keep the answer clear and concise.
- Include a brief explanation if it adds value.
- Do not include irrelevant information.
- Use proper grammar and spelling.
- Maintain a professional tone.

Question: {{question}}
Context: {{context}}

Answer:
```

**Prompt 2: Inaccurate Assistant**

```markdown
You are an assistant providing answers to questions, but you often make mistakes and include irrelevant information.

Instructions:
- Provide an answer to the question and include:
- Major and minor factual errors.
- Incomplete or insufficient information.
- Irrelevant or off-topic details.
- Grammatical and spelling mistakes
- Aim for a casual tone


Question: {{question}}
Context: {{context}}

Answer:
```

<figure><img src="/files/cctaTAEdZiNYcFg4MZup" alt=""><figcaption></figcaption></figure>

#### Creating and Deploying the Workflows

1. **Navigate to Workflows**: In the Dynamiq portal, go to the **Workflows** section.
2. **Create New Workflow**: Click on the **Create** button to start a new workflow.
3. **Configure Workflow**:
   * **Name**: Give each workflow a descriptive name (e.g., "accurate-workflow" and "inaccurate-workflow").
   * **Prompt**: Use the templates provided above for each workflow.
   * **LLM Selection**: Choose the appropriate LLM provider and model for generating responses.
4. **Deploy Workflows**: Once configured, deploy the workflows to start generating answers based on the provided prompts.

<figure><img src="/files/RLaOhqJUlpVaggqLFMYI" alt=""><figcaption></figcaption></figure>

By setting up these workflows, you can clearly see how LLM-as-a-judge metrics can distinguish between accurate and inaccurate responses, showcasing their power in evaluating AI-generated content.

### Creating an Evaluation Run for Workflows

Now that we have our workflows and metrics set up, it's time to create an evaluation run. This will allow us to assess the performance of our workflows using the metrics we previously defined.

<figure><img src="/files/6flEUJ5S7g4GnvHZeiXp" alt=""><figcaption></figcaption></figure>

#### Steps to Create an Evaluation Run

1. **Navigate to Evaluations**: In the Dynamiq portal, go to the **Evaluations** section.
2. **Create New Evaluation Run**: Click on the **New Evaluation Run** button to start setting up your evaluation.
3. **Configure Evaluation Run**:
   * **Name**: Enter a descriptive name for your evaluation run.
   * **Dataset**: Select the dataset you prepared earlier. Ensure you choose the correct version.
4. **Add Workflows**:
   * Click on **Add workflow**.
   * Select the workflows you want to evaluate (e.g., "accurate-workflow" and "inaccurate-workflow").
   * Choose the appropriate workflow version.
5. **Input Mappings**:
   * Map the dataset fields to the workflow inputs. For example:
     * **Context**: Map to `$.dataset.context`
     * **Question**: Map to `$.dataset.question`
6. **Add Metrics**:
   * Click on **Add metric**.
   * Select the metrics you want to use for evaluation (e.g., FactualAccuracy, Completeness).
   * Map the metric inputs to the appropriate fields:
     * **Question**: Map to `$.dataset.question`
     * **Answer**: Map to `$.workflow.answer`
     * **Ground Truth**: Map to `$.dataset.groundTruthAnswer`
7. **Create Evaluation Run**: Once all configurations are set, click the **Create** button to initiate the evaluation run.

### Running and Reviewing an Evaluation

Once you've set up your evaluation run, you can quickly assess the performance of your workflows using the metrics. Here's how to execute and review an evaluation run.

#### Steps to Execute an Evaluation Run

1. **Initiate Evaluation Run**: After configuring your evaluation settings, click **Create** to start the evaluation job. The system will begin processing the workflows with the selected metrics.
2. **Monitor Evaluation Status**: In the **Evaluations** section, you can see the status of your evaluation runs. It will initially show as "Running" and change to "Succeeded" once completed.
3. **Review Results**: Once the evaluation is complete, you can review the answers and their corresponding metrics.
4.

```
<figure><img src="/files/57ua9YCc2wfdDq7lMdul" alt=""><figcaption></figcaption></figure>
```

#### Reviewing Evaluation Results

* **Evaluation Runs Overview**: The main screen will list all evaluation runs, showing their names, statuses, and creators. Successful runs will be marked as "Succeeded."
* **Detailed Results**: Click on an evaluation run to see detailed results. You'll find:
  * **Context and Question**: The input data used for generating answers.
  * **Ground Truth Answer**: The correct answer for comparison.
  * **Workflow Outputs**: Answers generated by each workflow version.
  * **Metrics Scores**: Scores for each metric, such as Clarity and Coherence, Ethical Compliance, Language Quality, and Factual Accuracy.

<figure><img src="/files/t0QdVdPzv0OGFXAIOQ42" alt=""><figcaption></figcaption></figure>

### Conclusion

By running and reviewing evaluation runs, you can effectively measure the quality of your workflows. This process provides valuable insights into how well your workflows perform and where improvements can be made, ensuring high-quality outputs from your AI systems.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.getdynamiq.ai/old-version-evaluations/llm-as-a-judge.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.