📊Evaluation

The effectiveness of a RAG pipeline is assessed through four key evaluation benchmarks: Context Relevance, Answer Relevance, Groundedness, and Ground Truth. Each benchmark uses a scoring range from 0 to 10.

Context Relevance

Measures the relevance of the chunks retrieved by the auto_retriever in relation to the user's query. Determines the efficiency of the auto_retriever in fetching contextually relevant information, ensuring that the foundation for generating responses is solid. A score between 0 (least relevant) to 10 (most relevant) evaluates the retriever's performance in sourcing relevant data.

Parameters

  • User Query : The Question/Query to get the response of.

Code snippet

pipeline = generator.Generate(question=query,retriever=retriever,llm=llm)
print(pipeline.get_context_relevancy())

Answer Relevance

Evaluates the relevance of the LLM's response to the user query. It assess the LLM's ability to generate useful and appropriate answers, reflecting its utility in practical scenarios. A score from 0 (irrelevant) to 10 (highly relevant) quantifies the relevance of responses to user queries.

Parameters

  • User Query : The Question/Query to get the response of.

Code snippet

pipeline = generator.Generate(question=query,retriever=retriever,llm=llm)
print(pipeline.get_answer_relevancy())

Groundedness

Determines the extent to which the language model's responses are grounded in the information retrieved by the auto_retriever, aiming to identify any hallucinated content, it ensures that the outputs are based on factual information. The response is divided into statements which are then cross-referenced with retrieved chunks, scored from 0 (completely hallucinated) to 10 (fully grounded).

Parameters

  • User Query : The Question/Query to get the response of.

Code snippet

pipeline = generator.Generate(question=query,retriever=retriever,llm=llm)
print(pipeline.get_groundedness())

Ground Truth

Measures the alignment between the LLM's response and a predefined correct answer provided by the user. Evaluates the overall effectiveness of the pipeline in understanding and responding to queries as intended, serving as a comprehensive benchmark of performance. This benchmark considers the entire processing pipeline's ability to produce the expected outcome, with scores reflecting the degree of match to the ground truth answer. A score from 0 to 10 quantifies how well the LLM is performing.

Parameters

  • User Query : The Question/Query to get the response of.

  • Ground Truth : The actual answer to the user query passed earlier.

Code snippet

pipeline = generator.Generate(question=query,retriever=retriever,llm=llm)
print(pipeline.get_ground_truth(ground_truth))

RAG Triad

Computes and returns the relevancy (Context and Answer) and groundedness scores for the response generated by the pipeline. This method directly calculates all three key evaluation metrics.

  • Context Relevancy

  • Answer Relevancy

  • Groundedness

Code snippet

pipeline = generator.Generate(question=query,retriever=retriever,llm=llm)
print(pipeline.get_rag_triad_evals())

Last updated