The Answer Correctness Evaluator assesses the correctness of answers based on their alignment with ground truth answers. It extracts key statements from both answers and ground truths, classifies them into True Positives (TP), False Positives (FP), and False Negatives (FN), and computes similarity scores for each answer.
Key Formulas
Precision:
Recall:
F1 Score:
Final Score (combined from F1 score and similarity score):
where (w_1) and (w_2) are the weights assigned to the F1 and similarity scores, respectively.
Example Code: Answer Correctness Evaluation
This example demonstrates how to compute the Answer Correctness metric using the AnswerCorrectnessEvaluator with the OpenAI language model.
import loggingimport sysfrom dotenv import find_dotenv, load_dotenvfrom dynamiq.evaluations.metrics import AnswerCorrectnessEvaluatorfrom dynamiq.nodes.llms import OpenAI# Load environment variables for the OpenAI APIload_dotenv(find_dotenv())# Configure logging levellogging.basicConfig(stream=sys.stdout, level=logging.INFO)# Initialize the OpenAI language modelllm =OpenAI(model="gpt-4o-mini")# Sample dataquestions = ["What powers the sun and what is its primary function?","What is the boiling point of water?",]answers = [ ("The sun is powered by nuclear fission, similar to nuclear reactors on Earth."" Its primary function is to provide light to the solar system." ),"The boiling point of water is 100 degrees Celsius at sea level.",]ground_truth_answers = [ ("The sun is powered by nuclear fusion, where hydrogen atoms fuse to form helium."" This fusion process releases a tremendous amount of energy. The sun provides"" heat and light, which are essential for life on Earth." ), ("The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit) at"" sea level. The boiling point can change with altitude." ),]# Initialize evaluatorevaluator =AnswerCorrectnessEvaluator(llm=llm)# Evaluatecorrectness_scores = evaluator.run( questions=questions, answers=answers, ground_truth_answers=ground_truth_answers, verbose=False, # Set verbose=True to enable logging)# Print the resultsfor idx, score inenumerate(correctness_scores):print(f"Question: {questions[idx]}")print(f"Answer Correctness Score: {score}")print("-"*50)print("Answer Correctness Scores:")print(correctness_scores)