Metrics creation
Section 1: Creating Evaluation Metrics
In this section, we'll delve into the creation of evaluation metrics that will serve as the backbone of our workflow assessment. By designing clear and effective metrics, we enable Large Language Models (LLMs) to evaluate answers consistently and objectively. Let's explore how to craft these metrics and the corresponding prompts.
Understanding Evaluation Metrics
Evaluation metrics are criteria or standards used to measure the quality of answers generated by workflows. By defining specific metrics, we can assess various aspects of an answer, such as its accuracy, completeness, clarity, and more.
Key Benefits of Well-Defined Metrics:
Objectivity: Provides a standardized way to assess answers, reducing subjectivity.
Consistency: Ensures evaluations are uniform across different answers and evaluators.
Comprehensiveness: Allows for a multi-faceted assessment covering all important aspects of an answer.
How to create a metric in Dynamiq

Defining the Metrics
We'll focus on the following seven essential metrics:
Factual Accuracy
Completeness
Clarity and Coherence
Relevance
Language Quality
Ethical Compliance
Originality and Creativity
Each metric targets a specific dimension of answer quality, contributing to a holistic evaluation.
1. Factual Accuracy
Purpose: Assess whether the answer contains correct and verifiable information.
Why It Matters:
Trustworthiness: Users rely on accurate information.
Credibility: Enhances the reputation of the AI system.
Avoiding Misinformation: Prevents the spread of false or misleading details.
Prompt:
2. Completeness
Purpose: Determine if the answer fully addresses all aspects of the question.
Why It Matters:
User Satisfaction: Ensures users receive all the information they need.
Functionality: Important for tasks requiring comprehensive responses.
Prompt:
3. Clarity and Coherence
Purpose: Evaluate the clarity of expression and logical flow of the answer.
Why It Matters:
Understanding: Clear answers are easier for users to comprehend.
Professionalism: Reflects the quality and reliability of the AI system.
Prompt:
4. Relevance
Purpose: Check if the answer is directly relevant to the question asked.
Why It Matters:
Efficiency: Users value answers that address their queries directly.
Contextual Accuracy: Ensures the information provided is appropriate.
Prompt:
5. Language Quality
Purpose: Analyze grammar, spelling, and stylistic aspects of the answer.
Why It Matters:
Readability: Good language quality enhances comprehension.
Professional Impression: Reflects well on the AI system's sophistication.
Prompt:
6. Ethical Compliance
Purpose: Ensure the answer adheres to ethical standards, avoiding biased or inappropriate content.
Why It Matters:
User Safety: Prevents exposure to harmful or offensive material.
Regulatory Compliance: Adheres to legal and ethical guidelines.
Prompt:
7. Originality and Creativity
Purpose: Assess whether the answer brings unique insights or innovative perspectives.
Why It Matters:
Engagement: Creative answers can be more engaging for users.
Value Addition: Provides users with novel information or perspectives.
Prompt:
Implementing the Metrics
To use these evaluation metrics effectively:
Customize Prompts: Ensure that each prompt is tailored to the specific metric and uses consistent formatting.
Use Placeholder Variables: Replace
{{question}}and{{answer}}with the actual question and answer you're evaluating.Consistent Instructions: Emphasize that the evaluator (LLM) should output only the JSON object with the score, avoiding additional commentary.
Scoring Scale: Use clear definitions for each score value to guide the LLM in assigning appropriate scores.
Why Use JSON Output?
Machine-Readable: Easy to parse programmatically, enabling automated processing.
Standardized Format: Facilitates integration with various tools and databases.
Simplicity: Keeping the output minimal reduces the likelihood of errors in interpretation.
Examples of Metrics in Action
Let's see how these metrics work with an example.
Question:
"What is the capital of Australia?"
Answer Provided:
"Sydney."
Evaluating Factual Accuracy
Prompt:
Use the prompt from the Factual Accuracy metric, inserting the question and answer.
LLM Output:
Explanation:
The correct capital of Australia is Canberra, not Sydney.
According to our scoring scale, several factual errors result in a score of 2.
Tips for Creating Effective Evaluation Metrics
Clarity is Key: Make sure the prompts are clear and unambiguous.
Specific Scoring Criteria: Define what each score represents to guide consistent evaluations.
Avoid Overlapping Metrics: Ensure each metric assesses a distinct aspect to prevent redundancy.
Test Your Prompts: Run sample evaluations to see if the prompts yield the desired outputs.
Integrating Metrics with Dynamiq
Using these metrics within Dynamiq involves:
Setting Up Evaluations: Configure Dynamiq to use the prompts for each metric when evaluating workflow answers.
Automating Processes: Leverage Dynamiq's capabilities to automate the evaluation across your dataset.
Analyzing Results: Collect and analyze the scores to identify patterns, strengths, and areas for improvement.
Conclusion
By carefully crafting evaluation metrics and corresponding prompts, we're laying a solid foundation for assessing and improving our AI workflows. These metrics will enable us to:
Quantify Answer Quality: Turning qualitative aspects into measurable data.
Enhance Workflow Performance: Providing actionable insights to refine our systems.
Build User Trust: Ensuring that the outputs meet high standards of quality and reliability.
Next Steps:
Proceed to Section 2: Preparing the Evaluation Dataset, where we'll create a diverse set of questions and answers to test our evaluation framework.
Keep these metrics and prompts handy—they'll be essential tools as we move forward.
Let's continue our journey toward building reliable and trustworthy AI workflows! 🚀
Last updated