Factual Correctness
Last updated
Last updated
Factual Correctness measures the factual accuracy of a generated response compared to a reference response. It evaluates how well the generated content aligns with the factual information from the reference.
Score Range: The factual correctness score ranges from 0 to 1, where higher values indicate better performance.
Claim Breakdown: The metric breaks down both the generated response and reference into claims using a language model (LLM).
Natural Language Inference: It employs natural language inference to assess the factual overlap between the generated response and the reference.
The factual overlap is quantified using the following metrics:
Precision: The ratio of true positives (TP) to the sum of true positives and false positives (FP).
Recall: The ratio of true positives to the sum of true positives and false negatives (FN).
F1 Score: The harmonic mean of precision and recall, providing a single score to balance both metrics.
The accuracy of the alignment can be controlled using the mode parameter, allowing for adjustments based on the specific requirements of the evaluation.
The Factual Correctness metric is essential for assessing the reliability and accuracy of generated responses, ensuring alignment with factual information.
This example demonstrates how to compute the Factual Correctness metric using the FactualCorrectnessEvaluator
with the OpenAI language model.