This section demonstrates how to evaluate a LlamaIndex pipeline using BeyondLLM. We'll walk through the process step-by-step, explaining each component and its purpose.
LlamaIndex Evaluation
This section demonstrates how to evaluate a LlamaIndex pipeline using Mistral AI and BeyondLLM. We'll walk through the process step-by-step, explaining each component and its purpose.
Setup and Imports
First, let's import the necessary libraries and set up our environment:
import osfrom getpass import getpassfrom llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContextfrom llama_index.vector_stores.chroma import ChromaVectorStorefrom llama_index.embeddings.fastembed import FastEmbedEmbeddingfrom llama_index.llms.huggingface_api import HuggingFaceInferenceAPIimport chromadbfrom beyondllm.utils import CONTEXT_RELEVENCE, GROUNDEDNESS, ANSWER_RELEVENCEimport reimport numpy as npimport pysbd# Set up Hugging Face API TokenHUGGINGFACEHUB_API_TOKEN =getpass("API:")os.environ["HUGGINGFACEHUB_API_TOKEN"]= HUGGINGFACEHUB_API_TOKEN
This code sets up the necessary imports and securely prompts for the Hugging Face API token.
Document Loading and Model Configuration
Next, we'll load our documents and configure the embedding and language models:
# Load documentsdocuments =SimpleDirectoryReader("/content/sample_data/Data").load_data()# Configure embeddings and language modelembed_model =FastEmbedEmbedding(model_name="thenlper/gte-large")llm =HuggingFaceInferenceAPI( model_name="mistralai/Mistral-7B-Instruct-v0.2", token=HUGGINGFACEHUB_API_TOKEN)
Here, we load documents from a specified directory and set up our embedding model (FastEmbedEmbedding) and language model (Mistral AI via Hugging Face API).
Vector Store and Index Setup
Now, let's set up our vector store and create an index:
This code initializes a Chroma vector store, creates a collection, and builds an index from our documents using the configured embedding model.
Utility Functions
We'll define some utility functions to help with our evaluation:
defextract_number(text): match = re.search(r'\d+(\.\d+)?', text)returnfloat(match.group())if match else0defsent_tokenize(text): seg = pysbd.Segmenter(language="en", clean=False)return seg.segment(text)
These functions help extract numerical scores from text and tokenize sentences for evaluation.
These functions evaluate context relevancy, answer relevancy, and groundedness of the model's responses.
Evaluation Execution
Finally, let's execute our evaluation:
# Set up query enginequery_engine = index.as_query_engine()# Example queriesqueries = ["what doesnt cause heart diseases","what is the capital of turkey"]for query in queries:print(f"\nQuery: {query}") retrieved_documents = query_engine.retrieve(query) context = [doc.node.text for doc in retrieved_documents] response = query_engine.query(query)print(get_context_relevancy(llm, query, context))print(get_answer_relevancy(llm, query, response.response))print(get_groundedness(llm, query, context, response.response))
This code sets up a query engine, defines example queries, and runs the evaluation for each query.
Sample Output
Here's a sample output of the evaluation:
Query: what doesnt cause heart diseases
Context Relevancy Score: 3.0
Answer Relevancy Score: 6
Groundedness Score: 6.0
Query: what is the capital of turkey
Context Relevancy Score: 0.0
Answer Relevancy Score: 0
Groundedness Score: 5.0
This output shows the evaluation scores for context relevancy, answer relevancy, and groundedness for each query. The scores indicate how well the model performed in retrieving relevant context, providing relevant answers, and ensuring the answers are grounded in the provided context.