Dynamiq Docs
  • Welcome to Dynamiq
  • Low-Code Builder
    • Chat
    • Basics
    • Connecting Nodes
    • Conditional Nodes and Multiple Outputs
    • Input and Output Transformers
    • Error Handling and Retries
    • LLM Nodes
    • Validator Nodes
    • RAG Nodes
      • Indexing Workflow
        • Pre-processing Nodes
        • Document Splitting
        • Document Embedders
        • Document Writers
      • Inference RAG workflow
        • Text embedders
        • Document retrievers
          • Complex retrievers
        • LLM Answer Generators
    • LLM Agents
      • Basics
      • Guide to Implementing LLM Agents: ReAct and Simple Agents
      • Guide to Agent Orchestration: Linear and Adaptive Orchestrators
      • Guide to Advanced Agent Orchestration: Graph Orchestrator
    • Audio and voice
    • Tools and External Integrations
    • Python Code in Workflows
    • Memory
    • Guardrails
  • Deployments
    • Workflows
      • Tracing Workflow Execution
    • LLMs
      • Fine-tuned Adapters
      • Supported Models
    • Vector Databases
  • Prompts
    • Prompt Playground
  • Connections
  • LLM Fine-tuning
    • Basics
    • Using Adapters
    • Preparing Data
    • Supported Models
    • Parameters Guide
  • Knowledge Bases
  • Evaluations
    • Metrics
      • LLM-as-a-Judge
      • Predefined metrics
        • Faithfulness
        • Context Precision
        • Context Recall
        • Factual Correctness
        • Answer Correctness
      • Python Code Metrics
    • Datasets
    • Evaluation Runs
    • Examples
      • Build Accurate vs. Inaccurate Workflows
  • Examples
    • Building a Search Assistant
      • Approach 1: Single Agent with a Defined Role
      • Approach 2: Adaptive Orchestrator with Multiple Agents
      • Approach 3: Custom Logic Pipeline with a Straightforward Workflow
    • Building a Code Assistant
  • Platform Settings
    • Access Keys
    • Organizations
    • Settings
    • Billing
  • On-premise Deployment
    • AWS
    • IBM
  • Support Center
Powered by GitBook
On this page
  • Answer Correctness Evaluator
  • Example Code: Answer Correctness Evaluation
  1. Evaluations
  2. Metrics
  3. Predefined metrics

Answer Correctness

PreviousFactual CorrectnessNextPython Code Metrics

Last updated 4 months ago

Answer Correctness Evaluator

The Answer Correctness Evaluator assesses the correctness of answers based on their alignment with ground truth answers. It extracts key statements from both answers and ground truths, classifies them into True Positives (TP), False Positives (FP), and False Negatives (FN), and computes similarity scores for each answer.

Key Formulas

  1. Precision:

Precision=TPTP+FP \text{Precision} = \frac{TP}{TP + FP} Precision=TP+FPTP​
  1. Recall:

Recall=TPTP+FN \text{Recall} = \frac{TP}{TP + FN} Recall=TP+FNTP​
  1. F1 Score:

F1=2×Precision×RecallPrecision+Recall F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} F1=2×Precision+RecallPrecision×Recall​
  1. Final Score (combined from F1 score and similarity score):

Final Score=w1×F1+w2×Similarity Score \text{Final Score} = w_1 \times F1 + w_2 \times \text{Similarity Score} Final Score=w1​×F1+w2​×Similarity Score

where (w_1) and (w_2) are the weights assigned to the F1 and similarity scores, respectively.

Example Code: Answer Correctness Evaluation

This example demonstrates how to compute the Answer Correctness metric using the AnswerCorrectnessEvaluator with the OpenAI language model.

import logging
import sys
from dotenv import find_dotenv, load_dotenv
from dynamiq.evaluations.metrics import AnswerCorrectnessEvaluator
from dynamiq.nodes.llms import OpenAI

# Load environment variables for the OpenAI API
load_dotenv(find_dotenv())

# Configure logging level
logging.basicConfig(stream=sys.stdout, level=logging.INFO)

# Initialize the OpenAI language model
llm = OpenAI(model="gpt-4o-mini")

# Sample data
questions = [
    "What powers the sun and what is its primary function?",
    "What is the boiling point of water?",
]
answers = [
    (
        "The sun is powered by nuclear fission, similar to nuclear reactors on Earth."
        " Its primary function is to provide light to the solar system."
    ),
    "The boiling point of water is 100 degrees Celsius at sea level.",
]
ground_truth_answers = [
    (
        "The sun is powered by nuclear fusion, where hydrogen atoms fuse to form helium."
        " This fusion process releases a tremendous amount of energy. The sun provides"
        " heat and light, which are essential for life on Earth."
    ),
    (
        "The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit) at"
        " sea level. The boiling point can change with altitude."
    ),
]

# Initialize evaluator
evaluator = AnswerCorrectnessEvaluator(llm=llm)

# Evaluate
correctness_scores = evaluator.run(
    questions=questions,
    answers=answers,
    ground_truth_answers=ground_truth_answers,
    verbose=False,  # Set verbose=True to enable logging
)

# Print the results
for idx, score in enumerate(correctness_scores):
    print(f"Question: {questions[idx]}")
    print(f"Answer Correctness Score: {score}")
    print("-" * 50)

print("Answer Correctness Scores:")
print(correctness_scores)