Dynamiq Docs
  • Welcome to Dynamiq
  • Low-Code Builder
    • Chat
    • Basics
    • Connecting Nodes
    • Conditional Nodes and Multiple Outputs
    • Input and Output Transformers
    • Error Handling and Retries
    • LLM Nodes
    • Validator Nodes
    • RAG Nodes
      • Indexing Workflow
        • Pre-processing Nodes
        • Document Splitting
        • Document Embedders
        • Document Writers
      • Inference RAG workflow
        • Text embedders
        • Document retrievers
          • Complex retrievers
        • LLM Answer Generators
    • LLM Agents
      • Basics
      • Guide to Implementing LLM Agents: ReAct and Simple Agents
      • Guide to Agent Orchestration: Linear and Adaptive Orchestrators
      • Guide to Advanced Agent Orchestration: Graph Orchestrator
    • Audio and voice
    • Tools and External Integrations
    • Python Code in Workflows
    • Memory
    • Guardrails
  • Deployments
    • Workflows
      • Tracing Workflow Execution
    • LLMs
      • Fine-tuned Adapters
      • Supported Models
    • Vector Databases
  • Prompts
    • Prompt Playground
  • Connections
  • LLM Fine-tuning
    • Basics
    • Using Adapters
    • Preparing Data
    • Supported Models
    • Parameters Guide
  • Knowledge Bases
  • Evaluations
    • Metrics
      • LLM-as-a-Judge
      • Predefined metrics
        • Faithfulness
        • Context Precision
        • Context Recall
        • Factual Correctness
        • Answer Correctness
      • Python Code Metrics
    • Datasets
    • Evaluation Runs
    • Examples
      • Build Accurate vs. Inaccurate Workflows
  • Examples
    • Building a Search Assistant
      • Approach 1: Single Agent with a Defined Role
      • Approach 2: Adaptive Orchestrator with Multiple Agents
      • Approach 3: Custom Logic Pipeline with a Straightforward Workflow
    • Building a Code Assistant
  • Platform Settings
    • Access Keys
    • Organizations
    • Settings
    • Billing
  • On-premise Deployment
    • AWS
    • IBM
  • Support Center
Powered by GitBook
On this page
  • LLM-as-a-Judge Metrics
  • Creating a Metric
  • Creating an Evaluation Run for LLM-as-a-Judge Metrics
  • Conclusion
  1. Evaluations
  2. Metrics

LLM-as-a-Judge

PreviousMetricsNextPredefined metrics

Last updated 1 month ago

LLM-as-a-Judge Metrics

LLM-as-a-Judge metrics leverage large language models (LLMs) as powerful evaluation tools, allowing for automated, consistent, and customizable assessments of AI-generated outputs. By prompting an LLM to evaluate factors such as answer correctness, relevance, toxicity, grammatical accuracy, and faithfulness, you can quickly obtain reliable feedback on your Agents and RAG workflows.

Dynamiq simplifies this process—just define your evaluation criteria and let the LLM provide objective, scalable evaluations for rapid iteration and improved performance.

Creating a Metric

To create a metric in Dynamiq focused on LLM-as-a-Judge, follow these steps:

  1. Navigate to Metrics Creation:

    • Go to Evaluations -> Metrics -> Create a Metric. This will lead you to the interface for defining new metrics.

  2. Explore Existing Templates:

    • Dynamiq provides a variety of metric templates, such as Factual Accuracy, Completeness, Clarity and Coherence, Relevance, Language Quality, Ethical Compliance, Originality and Creativity. These can inspire your own custom prompts.

  3. Add a New Metric:

    • Click on the Add new metric button. A form will appear where you can specify the details of your metric.

    • Name: Enter a descriptive name for your metric.

    • Instructions: Either choose from the available templates or create a custom prompt.

    • LLM: Select the LLM provider and model you wish to use for metric calculation. Dynamiq supports integration with various providers such as OpenAI, Anthropic, and much more.

    • Connection: Establish a connection to the selected LLM model.

    • Temperature: Adjust the temperature setting to control the randomness of the model's output.

  4. Create: Once you’ve filled in all necessary fields, click the Create button to finalize your metric.

Creating an Evaluation Run for LLM-as-a-Judge Metrics

After creating your metric, you can set up an evaluation run to assess the performance of your workflows using the LLM-as-a-Judge metrics.

Steps to Create an Evaluation Run

  1. Navigate to Evaluations:

    • In the Dynamiq portal, go to Evaluations.

  2. Create New Evaluation Run:

    • Click on the New Evaluation Run button to begin setting up your evaluation.

  3. Configure Evaluation Run:

    • Name: Enter a descriptive name for your evaluation run.

    • Dataset: Select the dataset you created earlier and ensure you choose the correct version.

    • Add Workflows:

      • Click on Add workflow to select the workflows you want to evaluate.

  4. Input Mappings:

    • Map the dataset fields to the workflow inputs:

      • Context: Map to $.dataset.context

      • Question: Map to $.dataset.question

  5. Add Metrics:

    • Click on Add metric and select your newly created LLM-as-a-Judge metrics.

    • Map the metric inputs:

      • Question: Map to $.dataset.question

      • Answer: Map to $.workflow.answer

      • Ground Truth: Map to $.dataset.groundTruthAnswer

  6. Create Evaluation Run: Once you’ve completed all configurations, click the Create button to initiate the evaluation run.

Conclusion

By following these steps, you can effectively create LLM-as-a-Judge metrics and configure evaluation runs to assess the quality of your AI-generated outputs. This streamlined approach allows for efficient evaluation and continuous improvement of your AI workflows, ensuring that you deliver high-quality results.