🌐Source

What is Source?

Source refers to the origin of the RAG pipeline: data. The first step in building a RAG pipeline is to source the data from diverse origins, and transform it for usability. We follow a two step process: loading the data, followed by splitting/chunking of data.

The beyondllm.sourcemodule provides a variety of loaders to ingest and process data from different sources. This allows you to easily integrate your data into the RAG pipeline for question answering and information retrieval. This returns a List of TextNode objects.

fit Function

The central function for loading data is fit. It offers a unified interface for loading and processing data regardless of the source type. Whether you have local files, web pages, YouTube videos, or want to leverage the power of LlamaParse, fit simplifies the process, handling multiple input types with ease.

Centralized Data Loading:

from beyondllm.source import fit

data = fit(path="<your-doc-path-here>", dtype="<your-dtype>", chunk_size=512, chunk_overlap=100)

Parameters:

  • path (str or list): The path to your data source(s).

    • For single inputs: A string representing a local file path, a URL, or a YouTube video ID.

    • For multiple inputs: A list of strings, where each string is a file path, URL, or YouTube video ID.

  • dtype (str): Specifies the type of loader to use, based on your data format. Supported options include:

    • File Types: "pdf", "csv", "docx", "epub", "md", "ppt", "pptx", "pptm" (using SimpleLoader)

    • Web Pages: "url" (using UrlLoader)

    • YouTube Videos: "youtube" (using YoutubeLoader)

    • LlamaParse Cloud API: "llama-parse" (using LlamaParseLoader)

  • chunk_size (int): The desired length (in characters) for splitting text into chunks. Defaults to 512.

  • chunk_overlap (int): The number of overlapping characters between consecutive chunks to preserve context. Defaults to 100.

Returns:

  • List[TextNode]: A list of TextNode objects representing your processed data, ready for use in the RAG pipeline.

Available Loaders:

BeyondLLM provides a range of specialized loaders to handle different data types. All loaders are accessible through the fit function by simply changing the dtype parameter.

1. SimpleLoader:

Handles common file types like PDFs, Word documents, presentations, and Markdown files. Supports chunk_size and chunk_overlap parameters for text splitting. This currently supports the following file formats: "pdf", "csv", "docx", "epub", "md", "ppt", "pptx", "pptm", "txt".

  • File Types Supported: "pdf", "csv", "docx", "epub", "md", "ppt", "pptx", "pptm", "txt"

  • Multiple Inputs: Accepts a list of file paths.

  • Directory Loading: When providing a directory path as path, the SimpleLoader will automatically load all supported file types within that directory.

Code Snippet (Loading Multiple PDFs):

from beyondllm.source import fit

pdf_paths = ["path/to/document1.pdf", "path/to/document2.pdf", "path/to/document3.pdf"]
data = fit(path=pdf_paths, dtype="pdf")

or to load one file, it can just be passed as a path:

from beyondllm.source import fit

data = fit(path="path/to/document1.pdf", dtype="pdf")

Some file types require some additional libraries. Install the required library to parse .docx files:

pip install docx2text

For .ppt files:

pip install torch transformers python-pptx Pillow

To load multiple documents, a path to a directory can also be passed.

from beyondllm.source import fit

data = fit(path="path/to/directory/", dtype="pdf")

2. UrlLoader:

Extracts text content from web pages given a URL or a list of URLs. Supports chunk_size and chunk_overlap parameters. This requires an additional library which can be installed by running:

pip install llama-index-readers-web
  • Multiple Inputs: Accepts a list of URLs.

Code Snippet (Loading Multiple Web Pages):

from beyondllm.source import fit

urls = ["https://www.example.com", "https://www.anotherwebsite.org"]
data = fit(path=urls, dtype="url")

3. YoutubeLoader:

Downloads and processes transcripts from YouTube videos. requires the following additional libraries:

pip install llama-index-readers-web
  • Multiple Inputs: Accepts a list of YouTube video URLs.

Code Snippet (Loading Transcripts from Multiple Videos):

from beyondllm.source import fit

youtube_urls = ["https://www.youtube.com/watch?v=video_id_1", "https://www.youtube.com/watch?v=video_id_2"]
data = fit(path=youtube_urls, dtype="youtube")

4. LlamaParseLoader:

Leverages the LlamaParse Cloud API to extract structured information and text from various documents. Requires an additional llama_parse_key parameter or LLAMA_CLOUD_API_KEY environment variable to be set. Supports chunk_size and chunk_overlap parameters. We internally perform a markdown splitting on your data.

pip install llama-parse
  • File Types Supported (Currently): "pdf"

  • Requires: llama_parse_key parameter or LLAMA_CLOUD_API_KEY environment variable (obtain an API key from https://cloud.llamaindex.ai/login).

  • Multiple Inputs: Accepts a list of PDF file paths.

Code Snippet (Loading Multiple PDFs with LlamaParse):

from beyondllm.source import fit

pdf_paths = ["path/to/document1.pdf", "path/to/document2.pdf"]
data = fit(path=pdf_paths, dtype="llama-parse", llama_parse_key="your_llama_parse_api_key")
  • For LlamaParseLoader in Google Colab, run these lines before using fit:

    import nest_asyncio
    nest_asyncio.apply()

By leveraging the diverse range of loaders in BeyondLLM, you can effectively incorporate information from various sources into your RAG pipeline for enhanced question answering and knowledge retrieval capabilities.

Last updated