Evaluations

Evaluating the Quality of RAG Nodes

Evaluating the quality of Retrieval-Augmented Generation (RAG) nodes is crucial to ensure that your application delivers accurate and contextually relevant responses. By assessing the performance of these nodes, you can identify areas for improvement and optimize your workflows for better user experiences.

Importance of Evaluation

  1. Accuracy: Ensures that the generated responses are correct and align with the user's queries or the provided ground truth.

  2. Relevance: Measures how well the responses address the user's search queries, enhancing the application's usefulness.

  3. Optimization: Helps in fine-tuning the workflow components, leading to more efficient and effective RAG applications.

Using LLMs for Evaluation

Dynamiq provides tools to evaluate workflow responses using Language Models (LLMs) as judges. This approach leverages the capabilities of LLMs to assess the relevance and correctness of responses.

Example Code: Evaluating Relevance

The following example demonstrates how to evaluate the relevance of workflow responses using an LLM:

from dynamiq.components.evaluators.llm_evaluator import LLMEvaluator
from dynamiq.nodes.llms import BaseLLM, OpenAI
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

def run_relevance_to_search_query(llm: BaseLLM):
    instruction_text = """
    Evaluate the relevance of the "Answer" to the "Search Query".
    - Score the relevance from 0 to 1.
    - Use 1 if the Answer directly addresses the Search Query.
    - Use 0 if the Answer is irrelevant to the Search Query.
    - Provide a brief justification for the score.
    """
    
    evaluator = LLMEvaluator(
        instructions=instruction_text.strip(),
        inputs=[
            ("search_queries", list[str]),
            ("answers", list[str]),
        ],
        outputs=["relevance_score"],
        examples=[
            {
                "inputs": {
                    "search_queries": "Best Italian restaurants in New York",
                    "answers": "Here are the top-rated Italian restaurants in New York City...",
                },
                "outputs": {"relevance_score": 1},
            },
            {
                "inputs": {
                    "search_queries": "Weather forecast for tomorrow",
                    "answers": "Apple released a new iPhone model today.",
                },
                "outputs": {"relevance_score": 0},
            },
        ],
        llm=llm,
    )

    search_queries = [
        "How to bake a chocolate cake?",
        "What is the capital of France?",
        "Latest news on technology.",
    ]

    answers = [
        "To bake a chocolate cake, you need the following ingredients...",
        "The capital of France is Paris.",
        "The weather today is sunny with a chance of rain.",
    ]

    results = evaluator.run(search_queries=search_queries, answers=answers)
    return results

# Example usage with an OpenAI LLM:
if __name__ == "__main__":
    llm = OpenAI(model="gpt-4o-mini")
    relevance_results = run_relevance_to_search_query(llm)
    print("Answer Relevance to Search Query Results:")
    print(relevance_results)

# Output: Answer Relevance to Search Query Results: {'results': [{'relevance_score': 1}, {'relevance_score': 1}, {'relevance_score': 0}]}

Example Code: Evaluating Correctness

This example shows how to evaluate the correctness of responses by comparing them to a ground truth:

def run_correctness_comparing_to_ground_truth(llm: BaseLLM):
    instruction_text = """
    Evaluate the correctness of the "Answer" by comparing it to the "Ground Truth".
    - Score the correctness from 0 to 1.
    - Use 1 if the Answer is correct and matches the Ground Truth.
    - Use 0 if the Answer is incorrect or contradicts the Ground Truth.
    - Provide a brief explanation for the score.
    """

    evaluator = LLMEvaluator(
        instructions=instruction_text.strip(),
        inputs=[
            ("answers", list[str]),
            ("ground_truth", list[str]),
        ],
        outputs=["correctness_score"],
        examples=[
            {
                "inputs": {
                    "answers": "The capital of France is Paris.",
                    "ground_truth": "Paris is the capital of France.",
                },
                "outputs": {"correctness_score": 1},
            },
            {
                "inputs": {
                    "answers": "The capital of France is Berlin.",
                    "ground_truth": "Paris is the capital of France.",
                },
                "outputs": {"correctness_score": 0},
            },
        ],
        llm=llm,
    )

    answers = [
        "The capital of Germany is Berlin.",
        "Einstein developed the theory of gravity.",
        "The Great Wall is located in China.",
    ]

    ground_truth = [
        "Berlin is the capital of Germany.",
        "Newton developed the theory of gravity.",
        "The Great Wall of China is located in China.",
    ]

    results = evaluator.run(answers=answers, ground_truth=ground_truth)
    return results

# Example usage with an OpenAI LLM:
if __name__ == "__main__":
    llm = OpenAI(model="gpt-4o-mini")
    correctness_results = run_correctness_comparing_to_ground_truth(llm)
    print("\nAnswer Correctness Comparing to Ground Truth Results:")
    print(correctness_results)

# Output: Answer Correctness Comparing to Ground Truth Results: {'results': [{'correctness_score': 1}, {'correctness_score': 0}, {'correctness_score': 1}]}

By implementing these evaluation techniques, you can ensure that your RAG nodes are performing optimally, providing users with accurate and relevant information.

Last updated