🤖Auto Retriever
What is Auto Retriever?
Retrievers are essential components in BeyondLLM, responsible for efficiently fetching relevant information from the knowledge base based on user queries. They utilize the generated embeddings to perform similarity search and identify the most pertinent documents or passages.
We call this function the auto retriever, since it abstracts away all the complexity and allows you to define your retrieval type and rerankers all in one line. The auto_retriever
function from beyondllm.retrieve
allows you to set your retriever model.
The .retrieve("<your-text-here>")
function can then be used to get the List of NodeWithScore objects of your retrieval.
The auto_retriever function in BeyondLLM allows seamless integration with vector databases, streamlining the retrieval process.
Considerations for Vector DB in auto_retriever:
Data and Vector Database Interaction:
Optional Data: If a vectordb instance is provided, the data argument becomes optional. This means you can retrieve information directly from the existing data within the vector database without providing additional data.
Data Integration: If both data and vectordb are provided, the new data will be added to the existing data in the vector database, creating a combined dataset for retrieval.
Requirement for Hybrid Retriever: When using the hybrid retriever type, the data argument is mandatory. This is because the hybrid retriever employs a keyword-based search component, which requires access to the raw text data.
BeyondLLM provides several retriever types, each offering distinct approaches to information retrieval:
1. Normal Retriever
This is the most basic retriever, employing vector similarity search to find the top-k most similar documents to the user query based on their embeddings.
Parameters:
data
: The dataset containing the text data (already processed and split into nodes).embed_model
: The embedding model used to generate embeddings for the data.top_k
: The number of top results to retrieve.
Code Example:
2. Flag Embedding Reranker Retriever
This retriever enhances the normal retrieval process by incorporating a "flag embedding" reranker. The reranker further refines the initial results by considering the relevance of each retrieved document to the specific query, potentially improving retrieval accuracy.
Installation
Parameters:
data
: The dataset containing the text data (already processed and split into nodes).embed_model
: The embedding model used to generate embeddings for the data.top_k
: The number of top results to initially retrieve before reranking.reranker
: The name of the flag embedding reranker model. The default isBAAI/bge-reranker-large
.
Code Example:
3. Cross Encoder Reranker Retriever
Similar to the Flag Embedding Reranker, this retriever uses a cross-encoder model to rerank the initial retrieval results. Cross-encoders directly compare the query and document embeddings, often leading to more accurate relevance assessments.
Installation
Parameters:
data
: The dataset containing the text data (already processed and split into nodes).embed_model
: The embedding model used to generate embeddings for the data.top_k
: The number of top results to initially retrieve before reranking.reranker
: The name of the cross-encoder reranker model. The default iscross-encoder/ms-marco-MiniLM-L-2-v2
.
Code Example:
4. Hybrid Retriever
This retriever combines the strengths of both vector similarity search and keyword-based search. It retrieves documents that are both semantically similar to the query and contain relevant keywords, potentially providing more comprehensive results.
Parameters:
data
: The dataset containing the text data (already processed and split into nodes).embed_model
: The embedding model used to generate embeddings for the data.top_k
: The number of top results to retrieve for each search method (vector and keyword) before performing OR/AND operation.mode
: Determines how results are combined. Options areAND
(intersection of results) orOR
(union of results). The default isAND
.
NOTE: In case mode="OR", top_k
nodes will be retrieved, in case of mode="AND", the number of nodes retrieved will be lesser than or equal to top_k
Code Example:
Choosing the Right Retriever
The choice of retriever depends on your specific needs and the nature of your data:
Normal Retriever: Suitable for straightforward retrieval tasks where basic semantic similarity is sufficient.
Reranker Retrievers: Useful when higher accuracy is required and computational resources allow for reranking.
Hybrid Retriever: Beneficial when dealing with diverse queries or when keyword relevance is important alongside semantic similarity.
Last updated