Document Loaders
One of the first steps of any RAG framework or pipeline is to extract information (usually: text, especially in vanilla RAG) from the documents we want to be able to ask questions about. There are different ways of doing this but a common misconception is that this already heavily involves AI: most document loaders are (sometimes very old, even) plain Python libraries that can somehow parse a specific file format.1 For example, extracting text from a PDF file is something that we have been doing for decades, well before AI made its introduction in its current form.
Next to the actual inner workings of document loaders, there are different purposes of using them, depending on what type of RAG you are performing, non-exhaustively:
- For vanilla (Q&A on text) RAG, we usually just want to extract the textual information from files. Just to give a few examples: assume we have some JSON data with numerical information and arrays, while relevant, most of those numbers and arrays (unless textual) do not lend themselves well for vanilla RAG. We may still choose to just extract that information but bear in mind that RAG makes use of embedding models and to such a model a number 3 and a number 8 are nearly identical. Similarly, if we have a powerpoint with a lot of visuals, we can't really use them in any sensible way (unless we use a vision transformer of sorts, see 1) so we usually simply disregard those and extract the text that is there.
- A common use-case is to enrich your BI pipeline or otherwise use the power of AI on top of your structured data.2 Because (vanilla) RAG makes use of embeddings and performs natural-language Q&A (and a spreadsheet is not natural language), this is a mismatch in paradigm and problem space. To overcome this, there is a breed of RAG3 that makes use of Text2SQL where we follow the same principles as a vanilla RAG pipeline but our data is now stored in tables and because of that, we need to make sure that 1. the user's query is first transformed into a SQL-query, through an LLM and 2. the SQL results of the query are used to formulate the answer. So basically we add an invisible text-to-sql and sql-to-text layer in between.
- GraphRAG is a field of RAG that aims to construct a graph database instead of an embedding or SQL database. This works really well if you anticipate having user queries that reason about relations, potentially multi-hop. The downfall of vanilla RAG is that you chunk documents and as a consequence, long-term dependencies do not get captured in those chunks. This is where GraphRAG excels.4 Using GraphRAG however, does have implications for your document loading and chunking because we are no longer just looking for pieces of text - we actually need to transform that text into a graph representation. This is usually done by applying a form of entity (and relation) extraction right after the loader has extracted all text.
While RAG Me Up can be adapted to also perform Text2SQL and GraphRAG tasks5, in it's open-source form only performs vanilla RAG on textual information.
Different Loaders
So now that we know what document loaders are and what goal they server, we need to decide what file types to support and how to process those files. In the load_data
function inside RAGHelper.py you can see which file types are supported currently by RAG Me Up and how they are processed. Lets have a look at each file type and how they are handled.
JSON
As mentioned before, JSON data is usually a mix of structured/numerical and textual data. This makes it somewhat of a nuisance to process as we just want to have the textual data. One functional choice we can make is to just treat the full JSON object as a (textual) document but RAG Me Up supports other ways of loading it through jq and its schema:
with open(file, "r", encoding="utf-8") as f:
doc = json.load(f)
doc = jq_compiled.input(doc).first()
doc = json.dumps(doc)
We expect every document on its own to be valid JSON (object, array, etc.), so it's not jsonl
where we have a JSON element on each new line, keep this in mind. What we do is read in the file and parse it as JSON. We then applty the jq json_schema
that can be defined in the .env
file to select only parts of the JSON. For example: {.field1, .field2}
would only select the fields field1
and field2
from a JSON object and ignore all the other fields.
TXT and XML
While RAG Me Up used to have complex XML parsing in the past, it turned out to be a challenge to get it fool-proof so right now XML and TXT are treated the same way and are the most trivial document loaders in RAG Me Up as we just open the file and read in its contents, in full.
CSV
While if you are using CSV, you should probably question yourself whether you want to use vanilla RAG and hence RAG Me Up, instead of a Text2SQL framework, we do support loading it. CSVs are read in and converted into JSON so we have the column names with every row's cell values. We do this using Pandas:
import pandas as pd
df = pd.read_csv(file, encoding="utf-8", sep=os.getenv("csv_separator"))
json_data = df.to_dict(orient='records')
doc = json.dumps(json_data)
Powerpoint
Powerpoints can be loaded in the form of pptx
files (note: ppt
is not supported). We use the python-pptx library to load and parse a Powerpoint file and then go over every slide to extra the text in full, which is then simply concatenated into a big document:
presentation = Presentation(file)
full_text = []
for slide in presentation.slides:
slide_text = []
for shape in slide.shapes:
if shape.has_text_frame:
for paragraph in shape.text_frame.paragraphs:
slide_text.append(paragraph.text)
full_text.append("\n".join(slide_text))
doc = "\n\n".join(full_text)
Note that there is some handling of extracting text inside objects but quite limited. If you have really complex Powerpoints with a lot of nesting of texts inside shapes or objects, you may want to write more complex data extraction here.
PDF, DOCX, XLSX and other file types
For all other file types - most notably PDF and DOCX (again, not doc), RAG Me Up relies on Docling for text extraction. In the constructor of the RAGHelper
class, we create an instance of Docling:
self.converter = DocumentConverter()
Then we use this to extract and convert to text:
doc = self.converter.convert(file).document.export_to_text()
If your LLM or use-case benefits from this, you may want to replace the export_to_text()
function with export_to_markdown()
instead to have some formatting or markup present in your data.
Footnotes
-
There is quite some nuance here. If you have visual information or more advanced structures in your documents, you may choose to use models like vision transformers or other forms of AI just to extract information from your documents. This however, is a quite advanced way of loading your documents and not common. It also scales pretty poorly because on this date, there aren't many APIs or commercial provides of vision transformers available, with a notable exception of Google Gemini's image understanding capabilities. ↩ ↩2
-
Structured data is basically everything you can put in a database or in an Excel sheet: rows, columns and cell values (that are usually numerical, or, at best, brief texts). Unstructured information is the stuff we humans thrive on but computers - unless for AI - do not understand too well: text, images, speech, video. ↩
-
Text2SQL is not always RAG and vice versa, but it is a part of it when you use structured data. ↩
-
Typical use-cases for GraphRAG are when you have a set of documents where dependencies (cross-references) exist between those documents. Similarly, having really large documents with references inside those documents themselves, like legal or financial documents. There is no proper way of chunking them while capturing those relations and even though a growing counter-argument is that we can just inject the whole document into the prompt with ever-growing context-windows, LLMs typically get exponentially worse are understanding paragraph-level information as documents grow larger. On top of that, the whole benefit of RAG is not just prompting with documents but there is a whole retrieval step upfront and this becomes particularly challenging when you choose to inject documents in full. ↩