📊Evaluation

The effectiveness of a RAG pipeline is assessed through four key evaluation benchmarks: Context Relevance, Answer Relevance, Groundedness, and Ground Truth. Each benchmark uses a scoring range from 0 to 10.

Context Relevance

Measures the relevance of the chunks retrieved by the auto_retriever in relation to the user's query. Determines the efficiency of the auto_retriever in fetching contextually relevant information, ensuring that the foundation for generating responses is solid. A score between 0 (least relevant) to 10 (most relevant) evaluates the retriever's performance in sourcing relevant data.

Parameters

User Query : The Question/Query to get the response of.

Code snippet

pipeline = generator.Generate(question=query,retriever=retriever,llm=llm)
print(pipeline.get_context_relevancy())

Answer Relevance

Evaluates the relevance of the LLM's response to the user query. It assess the LLM's ability to generate useful and appropriate answers, reflecting its utility in practical scenarios. A score from 0 (irrelevant) to 10 (highly relevant) quantifies the relevance of responses to user queries.

Parameters

User Query : The Question/Query to get the response of.

Code snippet

pipeline = generator.Generate(question=query,retriever=retriever,llm=llm)
print(pipeline.get_answer_relevancy())

Groundedness

Determines the extent to which the language model's responses are grounded in the information retrieved by the auto_retriever, aiming to identify any hallucinated content, it ensures that the outputs are based on factual information. The response is divided into statements which are then cross-referenced with retrieved chunks, scored from 0 (completely hallucinated) to 10 (fully grounded).

Parameters

User Query : The Question/Query to get the response of.

Code snippet

pipeline = generator.Generate(question=query,retriever=retriever,llm=llm)
print(pipeline.get_groundedness())

Ground Truth

Measures the alignment between the LLM's response and a predefined correct answer provided by the user. Evaluates the overall effectiveness of the pipeline in understanding and responding to queries as intended, serving as a comprehensive benchmark of performance. This benchmark considers the entire processing pipeline's ability to produce the expected outcome, with scores reflecting the degree of match to the ground truth answer. A score from 0 to 10 quantifies how well the LLM is performing.

Parameters

User Query : The Question/Query to get the response of.
Ground Truth : The actual answer to the user query passed earlier.

Code snippet

pipeline = generator.Generate(question=query,retriever=retriever,llm=llm)
print(pipeline.get_ground_truth(ground_truth))

RAG Triad

Computes and returns the relevancy (Context and Answer) and groundedness scores for the response generated by the pipeline. This method directly calculates all three key evaluation metrics.

Context Relevancy
Answer Relevancy
Groundedness

Code snippet

pipeline = generator.Generate(question=query,retriever=retriever,llm=llm)
print(pipeline.get_rag_triad_evals())

PreviousMemory NextObservability

Last updated 1 year ago