chart-mixedEvaluations code

Evaluating the Quality of RAG Nodes

Evaluating the quality of Retrieval-Augmented Generation (RAG) nodes is crucial to ensure that your application delivers accurate and contextually relevant responses. By assessing the performance of these nodes, you can identify areas for improvement and optimize your workflows for better user experiences.

Importance of Evaluation

  1. Accuracy: Ensures that the generated responses are correct and align with the user's queries or the provided ground truth.

  2. Relevance: Measures how well the responses address the user's search queries, enhancing the application's usefulness.

  3. Optimization: Helps in fine-tuning the workflow components, leading to more efficient and effective RAG applications.

Using LLMs for Evaluation

Dynamiq provides tools to evaluate workflow responses using Language Models (LLMs) as judges. This approach leverages the capabilities of LLMs to assess the relevance and correctness of responses.

Example Code: Evaluating Relevance

The following example demonstrates how to evaluate the relevance of workflow responses using an LLM:

from dynamiq.components.evaluators.llm_evaluator import LLMEvaluator
from dynamiq.nodes.llms import BaseLLM, OpenAI
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

def run_relevance_to_search_query(llm: BaseLLM):
    instruction_text = """
    Evaluate the relevance of the "Answer" to the "Search Query".
    - Score the relevance from 0 to 1.
    - Use 1 if the Answer directly addresses the Search Query.
    - Use 0 if the Answer is irrelevant to the Search Query.
    - Provide a brief justification for the score.
    """
    
    evaluator = LLMEvaluator(
        instructions=instruction_text.strip(),
        inputs=[
            ("search_queries", list[str]),
            ("answers", list[str]),
        ],
        outputs=["relevance_score"],
        examples=[
            {
                "inputs": {
                    "search_queries": "Best Italian restaurants in New York",
                    "answers": "Here are the top-rated Italian restaurants in New York City...",
                },
                "outputs": {"relevance_score": 1},
            },
            {
                "inputs": {
                    "search_queries": "Weather forecast for tomorrow",
                    "answers": "Apple released a new iPhone model today.",
                },
                "outputs": {"relevance_score": 0},
            },
        ],
        llm=llm,
    )

    search_queries = [
        "How to bake a chocolate cake?",
        "What is the capital of France?",
        "Latest news on technology.",
    ]

    answers = [
        "To bake a chocolate cake, you need the following ingredients...",
        "The capital of France is Paris.",
        "The weather today is sunny with a chance of rain.",
    ]

    results = evaluator.run(search_queries=search_queries, answers=answers)
    return results

# Example usage with an OpenAI LLM:
if __name__ == "__main__":
    llm = OpenAI(model="gpt-4o-mini")
    relevance_results = run_relevance_to_search_query(llm)
    print("Answer Relevance to Search Query Results:")
    print(relevance_results)

# Output: Answer Relevance to Search Query Results: {'results': [{'relevance_score': 1}, {'relevance_score': 1}, {'relevance_score': 0}]}

Example Code: Evaluating Correctness

This example shows how to evaluate the correctness of responses by comparing them to a ground truth:

By implementing these evaluation techniques, you can ensure that your RAG nodes are performing optimally, providing users with accurate and relevant information.

Last updated