Metrics creation

Section 1: Creating Evaluation Metrics

In this section, we'll delve into the creation of evaluation metrics that will serve as the backbone of our workflow assessment. By designing clear and effective metrics, we enable Large Language Models (LLMs) to evaluate answers consistently and objectively. Let's explore how to craft these metrics and the corresponding prompts.

Understanding Evaluation Metrics

Evaluation metrics are criteria or standards used to measure the quality of answers generated by workflows. By defining specific metrics, we can assess various aspects of an answer, such as its accuracy, completeness, clarity, and more.

Key Benefits of Well-Defined Metrics:

Objectivity: Provides a standardized way to assess answers, reducing subjectivity.
Consistency: Ensures evaluations are uniform across different answers and evaluators.
Comprehensiveness: Allows for a multi-faceted assessment covering all important aspects of an answer.

How to create a metric in Dynamiq

Defining the Metrics

We'll focus on the following seven essential metrics:

Factual Accuracy
Completeness
Clarity and Coherence
Relevance
Language Quality
Ethical Compliance
Originality and Creativity

Each metric targets a specific dimension of answer quality, contributing to a holistic evaluation.

1. Factual Accuracy

Purpose: Assess whether the answer contains correct and verifiable information.

Why It Matters:

Trustworthiness: Users rely on accurate information.
Credibility: Enhances the reputation of the AI system.
Avoiding Misinformation: Prevents the spread of false or misleading details.

Prompt:

You are tasked with evaluating the factual accuracy of the answer provided for the question below. Identify any incorrect or unsupported statements.

Provide a score from 1 to 5 (integer only), where:
- 1 = Many factual errors
- 2 = Several factual errors
- 3 = Some factual errors
- 4 = Mostly accurate with minor errors
- 5 = Completely accurate

**Question:**
{{question}}

**Answer:**
{{answer}}

**Instructions:**
- Do not include any additional commentary.
- Output your evaluation as a JSON object with a single key `"score"` and the integer value.

**Example:**
{"score": 4}

2. Completeness

Purpose: Determine if the answer fully addresses all aspects of the question.

Why It Matters:

User Satisfaction: Ensures users receive all the information they need.
Functionality: Important for tasks requiring comprehensive responses.

Prompt:

Evaluate the completeness of the answer in response to the question. Does it cover all necessary points?

Provide a score from 1 to 5 (integer only), where:
- 1 = Very incomplete
- 2 = Largely incomplete
- 3 = Partially complete
- 4 = Mostly complete
- 5 = Fully comprehensive

**Question:**
{{question}}

**Answer:**
{{answer}}

**Instructions:**
- Do not include any additional commentary.
- Output your evaluation as a JSON object with a single key `"score"` and the integer value.

**Example:**
{"score": 3}

3. Clarity and Coherence

Purpose: Evaluate the clarity of expression and logical flow of the answer.

Why It Matters:

Understanding: Clear answers are easier for users to comprehend.
Professionalism: Reflects the quality and reliability of the AI system.

Prompt:

Assess the clarity and coherence of the answer. Is it well-organized and easy to understand?

Provide a score from 1 to 5 (integer only), where:
- 1 = Very unclear or disorganized
- 2 = Unclear with significant issues
- 3 = Somewhat clear with minor issues
- 4 = Clear with minimal issues
- 5 = Very clear and well-structured

**Answer:**
{{answer}}

**Instructions:**
- Do not include any additional commentary.
- Output your evaluation as a JSON object with a single key `"score"` and the integer value.

**Example:**
{"score": 5}

4. Relevance

Purpose: Check if the answer is directly relevant to the question asked.

Why It Matters:

Efficiency: Users value answers that address their queries directly.
Contextual Accuracy: Ensures the information provided is appropriate.

Prompt:

Determine the relevance of the answer to the question. Does it address the topic appropriately?

Provide a score from 1 to 5 (integer only), where:
- 1 = Not relevant at all
- 2 = Slightly relevant
- 3 = Partially relevant
- 4 = Mostly relevant
- 5 = Highly relevant

**Question:**
{{question}}

**Answer:**
{{answer}}

**Instructions:**
- Do not include any additional commentary.
- Output your evaluation as a JSON object with a single key `"score"` and the integer value.

**Example:**
{"score": 2}

5. Language Quality

Purpose: Analyze grammar, spelling, and stylistic aspects of the answer.

Why It Matters:

Readability: Good language quality enhances comprehension.
Professional Impression: Reflects well on the AI system's sophistication.

Prompt:

Analyze the language quality of the answer, focusing on grammar, spelling, and style.

Provide a score from 1 to 5 (integer only), where:
- 1 = Poor language quality with many errors
- 2 = Several errors affecting readability
- 3 = Acceptable language with some errors
- 4 = Good language quality with minor errors
- 5 = Excellent language quality

**Answer:**
{{answer}}

**Instructions:**
- Do not include any additional commentary.
- Output your evaluation as a JSON object with a single key `"score"` and the integer value.

**Example:**
{"score": 5}

6. Ethical Compliance

Purpose: Ensure the answer adheres to ethical standards, avoiding biased or inappropriate content.

Why It Matters:

User Safety: Prevents exposure to harmful or offensive material.
Regulatory Compliance: Adheres to legal and ethical guidelines.

Prompt:

Evaluate the answer for ethical compliance, checking for biased, offensive, or inappropriate content.

Provide a score from 1 to 5 (integer only), where:
- 1 = Significant ethical issues present
- 2 = Some ethical concerns
- 3 = Minor ethical issues
- 4 = Mostly ethically compliant
- 5 = Fully compliant with ethical standards

**Answer:**
{{answer}}

**Instructions:**
- Do not include any additional commentary.
- Output your evaluation as a JSON object with a single key `"score"` and the integer value.

**Example:**
{"score": 5}

7. Originality and Creativity

Purpose: Assess whether the answer brings unique insights or innovative perspectives.

Why It Matters:

Engagement: Creative answers can be more engaging for users.
Value Addition: Provides users with novel information or perspectives.

Prompt:

Assess the originality and creativity of the answer. Does it offer unique insights or solutions?

Provide a score from 1 to 5 (integer only), where:
- 1 = No originality
- 2 = Minimal originality
- 3 = Some original elements
- 4 = Original with interesting insights
- 5 = Highly original and creative

**Answer:**
{{answer}}

**Instructions:**
- Do not include any additional commentary.
- Output your evaluation as a JSON object with a single key `"score"` and the integer value.

**Example:**

{"score": 4}

Implementing the Metrics

To use these evaluation metrics effectively:

Customize Prompts: Ensure that each prompt is tailored to the specific metric and uses consistent formatting.
Use Placeholder Variables: Replace {{question}} and {{answer}} with the actual question and answer you're evaluating.
Consistent Instructions: Emphasize that the evaluator (LLM) should output only the JSON object with the score, avoiding additional commentary.
Scoring Scale: Use clear definitions for each score value to guide the LLM in assigning appropriate scores.

Why Use JSON Output?

Machine-Readable: Easy to parse programmatically, enabling automated processing.
Standardized Format: Facilitates integration with various tools and databases.
Simplicity: Keeping the output minimal reduces the likelihood of errors in interpretation.

Examples of Metrics in Action

Let's see how these metrics work with an example.

Question:

"What is the capital of Australia?"

Answer Provided:

"Sydney."

Evaluating Factual Accuracy

Prompt:

Use the prompt from the Factual Accuracy metric, inserting the question and answer.

You are tasked with evaluating the factual accuracy of the answer provided for the question below. Identify any incorrect or unsupported statements.

Provide a score from 1 to 5 (integer only), where:
- 1 = Many factual errors
- 2 = Several factual errors
- 3 = Some factual errors
- 4 = Mostly accurate with minor errors
- 5 = Completely accurate

**Question:**
What is the capital of Australia?

**Answer:**
Sydney.

**Instructions:**
- Do not include any additional commentary.
- Output your evaluation as a JSON object with a single key `"score"` and the integer value.

**Example:**
{"score": 3}

LLM Output:

{"score": 2}

Explanation:

The correct capital of Australia is Canberra, not Sydney.
According to our scoring scale, several factual errors result in a score of 2.

Tips for Creating Effective Evaluation Metrics

Clarity is Key: Make sure the prompts are clear and unambiguous.
Specific Scoring Criteria: Define what each score represents to guide consistent evaluations.
Avoid Overlapping Metrics: Ensure each metric assesses a distinct aspect to prevent redundancy.
Test Your Prompts: Run sample evaluations to see if the prompts yield the desired outputs.

Integrating Metrics with Dynamiq

Using these metrics within Dynamiq involves:

Setting Up Evaluations: Configure Dynamiq to use the prompts for each metric when evaluating workflow answers.
Automating Processes: Leverage Dynamiq's capabilities to automate the evaluation across your dataset.
Analyzing Results: Collect and analyze the scores to identify patterns, strengths, and areas for improvement.

Conclusion

By carefully crafting evaluation metrics and corresponding prompts, we're laying a solid foundation for assessing and improving our AI workflows. These metrics will enable us to:

Quantify Answer Quality: Turning qualitative aspects into measurable data.
Enhance Workflow Performance: Providing actionable insights to refine our systems.
Build User Trust: Ensuring that the outputs meet high standards of quality and reliability.

Next Steps:

Proceed to Section 2: Preparing the Evaluation Dataset, where we'll create a diverse set of questions and answers to test our evaluation framework.
Keep these metrics and prompts handy—they'll be essential tools as we move forward.

Let's continue our journey toward building reliable and trustworthy AI workflows! 🚀

Last updated 9 months ago