LLM-as-a-Judge
Last updated
Last updated
LLM-as-a-Judge metrics leverage large language models (LLMs) as powerful evaluation tools, allowing for automated, consistent, and customizable assessments of AI-generated outputs. By prompting an LLM to evaluate factors such as answer correctness, relevance, toxicity, grammatical accuracy, and faithfulness, you can quickly obtain reliable feedback on your Agents and RAG workflows.
Dynamiq simplifies this process—just define your evaluation criteria and let the LLM provide objective, scalable evaluations for rapid iteration and improved performance.
To create a metric in Dynamiq focused on LLM-as-a-Judge, follow these steps:
Navigate to Metrics Creation:
Go to Evaluations -> Metrics -> Create a Metric. This will lead you to the interface for defining new metrics.
Explore Existing Templates:
Dynamiq provides a variety of metric templates, such as Factual Accuracy, Completeness, Clarity and Coherence, Relevance, Language Quality, Ethical Compliance, Originality and Creativity. These can inspire your own custom prompts.
Add a New Metric:
Click on the Add new metric button. A form will appear where you can specify the details of your metric.
Name: Enter a descriptive name for your metric.
Instructions: Either choose from the available templates or create a custom prompt.
LLM: Select the LLM provider and model you wish to use for metric calculation. Dynamiq supports integration with various providers such as OpenAI, Anthropic, and much more.
Connection: Establish a connection to the selected LLM model.
Temperature: Adjust the temperature setting to control the randomness of the model's output.
Create: Once you’ve filled in all necessary fields, click the Create button to finalize your metric.
After creating your metric, you can set up an evaluation run to assess the performance of your workflows using the LLM-as-a-Judge metrics.
Navigate to Evaluations:
In the Dynamiq portal, go to Evaluations.
Create New Evaluation Run:
Click on the New Evaluation Run button to begin setting up your evaluation.
Configure Evaluation Run:
Name: Enter a descriptive name for your evaluation run.
Dataset: Select the dataset you created earlier and ensure you choose the correct version.
Add Workflows:
Click on Add workflow to select the workflows you want to evaluate.
Input Mappings:
Map the dataset fields to the workflow inputs:
Context: Map to $.dataset.context
Question: Map to $.dataset.question
Add Metrics:
Click on Add metric and select your newly created LLM-as-a-Judge metrics.
Map the metric inputs:
Question: Map to $.dataset.question
Answer: Map to $.workflow.answer
Ground Truth: Map to $.dataset.groundTruthAnswer
Create Evaluation Run: Once you’ve completed all configurations, click the Create button to initiate the evaluation run.
By following these steps, you can effectively create LLM-as-a-Judge metrics and configure evaluation runs to assess the quality of your AI-generated outputs. This streamlined approach allows for efficient evaluation and continuous improvement of your AI workflows, ensuring that you deliver high-quality results.