π¦ LlamaIndex
This section demonstrates how to evaluate a LlamaIndex pipeline using BeyondLLM. We'll walk through the process step-by-step, explaining each component and its purpose.
LlamaIndex Evaluation
This section demonstrates how to evaluate a LlamaIndex pipeline using Mistral AI and BeyondLLM. We'll walk through the process step-by-step, explaining each component and its purpose.
Setup and Imports
First, let's import the necessary libraries and set up our environment:
import os
from getpass import getpass
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
import chromadb
from beyondllm.utils import CONTEXT_RELEVENCE, GROUNDEDNESS, ANSWER_RELEVENCE
import re
import numpy as np
import pysbd
# Set up Hugging Face API Token
HUGGINGFACEHUB_API_TOKEN = getpass("API:")
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKENThis code sets up the necessary imports and securely prompts for the Hugging Face API token.
Document Loading and Model Configuration
Next, we'll load our documents and configure the embedding and language models:
Here, we load documents from a specified directory and set up our embedding model (FastEmbedEmbedding) and language model (Mistral AI via Hugging Face API).
Vector Store and Index Setup
Now, let's set up our vector store and create an index:
This code initializes a Chroma vector store, creates a collection, and builds an index from our documents using the configured embedding model.
Utility Functions
We'll define some utility functions to help with our evaluation:
These functions help extract numerical scores from text and tokenize sentences for evaluation.
Evaluation Functions
Now, let's implement our evaluation functions:
These functions evaluate context relevancy, answer relevancy, and groundedness of the model's responses.
Evaluation Execution
Finally, let's execute our evaluation:
This code sets up a query engine, defines example queries, and runs the evaluation for each query.
Sample Output
Here's a sample output of the evaluation:
This output shows the evaluation scores for context relevancy, answer relevancy, and groundedness for each query. The scores indicate how well the model performed in retrieving relevant context, providing relevant answers, and ensuring the answers are grounded in the provided context.
Last updated