Dynamiq Docs
  • Welcome to Dynamiq
  • Low-Code Builder
    • Chat
    • Basics
    • Connecting Nodes
    • Conditional Nodes and Multiple Outputs
    • Input and Output Transformers
    • Error Handling and Retries
    • LLM Nodes
    • Validator Nodes
    • RAG Nodes
      • Indexing Workflow
        • Pre-processing Nodes
        • Document Splitting
        • Document Embedders
        • Document Writers
      • Inference RAG workflow
        • Text embedders
        • Document retrievers
          • Complex retrievers
        • LLM Answer Generators
    • LLM Agents
      • Basics
      • Guide to Implementing LLM Agents: ReAct and Simple Agents
      • Guide to Agent Orchestration: Linear and Adaptive Orchestrators
      • Guide to Advanced Agent Orchestration: Graph Orchestrator
    • Audio and voice
    • Tools and External Integrations
    • Python Code in Workflows
    • Memory
    • Guardrails
  • Deployments
    • Workflows
      • Tracing Workflow Execution
    • LLMs
      • Fine-tuned Adapters
      • Supported Models
    • Vector Databases
  • Prompts
    • Prompt Playground
  • Connections
  • LLM Fine-tuning
    • Basics
    • Using Adapters
    • Preparing Data
    • Supported Models
    • Parameters Guide
  • Knowledge Bases
  • Evaluations
    • Metrics
      • LLM-as-a-Judge
      • Predefined metrics
        • Faithfulness
        • Context Precision
        • Context Recall
        • Factual Correctness
        • Answer Correctness
      • Python Code Metrics
    • Datasets
    • Evaluation Runs
    • Examples
      • Build Accurate vs. Inaccurate Workflows
  • Examples
    • Building a Search Assistant
      • Approach 1: Single Agent with a Defined Role
      • Approach 2: Adaptive Orchestrator with Multiple Agents
      • Approach 3: Custom Logic Pipeline with a Straightforward Workflow
    • Building a Code Assistant
  • Platform Settings
    • Access Keys
    • Organizations
    • Settings
    • Billing
  • On-premise Deployment
    • AWS
    • IBM
  • Support Center
Powered by GitBook
On this page
  • Factual Correctness Metric
  • Example Code: Factual Correctness Evaluation
  1. Evaluations
  2. Metrics
  3. Predefined metrics

Factual Correctness

PreviousContext RecallNextAnswer Correctness

Last updated 4 months ago

Factual Correctness Metric

Factual Correctness measures the factual accuracy of a generated response compared to a reference response. It evaluates how well the generated content aligns with the factual information from the reference.

Key Features

  • Score Range: The factual correctness score ranges from 0 to 1, where higher values indicate better performance.

  • Claim Breakdown: The metric breaks down both the generated response and reference into claims using a language model (LLM).

  • Natural Language Inference: It employs natural language inference to assess the factual overlap between the generated response and the reference.

Alignment Measurement

The factual overlap is quantified using the following metrics:

  • Precision: The ratio of true positives (TP) to the sum of true positives and false positives (FP).

  • Recall: The ratio of true positives to the sum of true positives and false negatives (FN).

  • F1 Score: The harmonic mean of precision and recall, providing a single score to balance both metrics.

Mode Parameter

The accuracy of the alignment can be controlled using the mode parameter, allowing for adjustments based on the specific requirements of the evaluation.

Summary

The Factual Correctness metric is essential for assessing the reliability and accuracy of generated responses, ensuring alignment with factual information.

Example Code: Factual Correctness Evaluation

This example demonstrates how to compute the Factual Correctness metric using the FactualCorrectnessEvaluator with the OpenAI language model.

import logging
import sys
from dotenv import find_dotenv, load_dotenv
from dynamiq.evaluations.metrics import FactualCorrectnessEvaluator
from dynamiq.nodes.llms import OpenAI

# Load environment variables for the OpenAI API
load_dotenv(find_dotenv())

# Configure logging level
logging.basicConfig(stream=sys.stdout, level=logging.INFO)

# Initialize the OpenAI language model
llm = OpenAI(model="gpt-4o-mini")

# Sample data
answers = [
    (
        "Albert Einstein was a German theoretical physicist. "
        "He developed the theory of relativity and contributed "
        "to quantum mechanics."
    ),
    ("The Eiffel Tower is located in Berlin, Germany. " "It was constructed in 1889."),
]
contexts = [
    ("Albert Einstein was a German-born theoretical physicist. " "He developed the theory of relativity."),
    ("The Eiffel Tower is located in Paris, France. " "It was constructed in 1887 and opened in 1889."),
]

# Initialize evaluator and evaluate
evaluator = FactualCorrectnessEvaluator(llm=llm)
correctness_scores = evaluator.run(
    answers=answers, 
    contexts=contexts, 
    verbose=True  # Set to False to disable verbose logging
)

# Print the results
for idx, score in enumerate(correctness_scores):
    print(f"Answer: {answers[idx]}")
    print(f"Factual Correctness Score: {score}")
    print("-" * 50)

print("Factual Correctness Scores:")
print(correctness_scores)
True Positive (TP)=Number of claims in response that are present in reference\text{True Positive (TP)} = \text{Number of claims in response that are present in reference}True Positive (TP)=Number of claims in response that are present in reference
False Positive (FP)=Number of claims in response that are not present in reference\text{False Positive (FP)} = \text{Number of claims in response that are not present in reference}False Positive (FP)=Number of claims in response that are not present in reference
False Negative (FN)=Number of claims in reference that are not present in response\text{False Negative (FN)} = \text{Number of claims in reference that are not present in response}False Negative (FN)=Number of claims in reference that are not present in response
Precision=TP(TP+FP)\text{Precision} = {TP \over (TP + FP)}Precision=(TP+FP)TP​
Recall=TP(TP+FN)\text{Recall} = {TP \over (TP + FN)}Recall=(TP+FN)TP​
F1 Score=2×Precision×Recall(Precision+Recall)\text{F1 Score} = {2 \times \text{Precision} \times \text{Recall} \over (\text{Precision} + \text{Recall})}F1 Score=(Precision+Recall)2×Precision×Recall​