Skip to main content

Chunking

One of the most crucial benefits of using RAG over any other way of feeding documents to an LLM (most importantly: just uploading the whole document and putting that in the prompt), is that RAG is able to find only the relevant parts of your document(s) that are required for answering the question. This is the so-called needle-in-a-haystack problem that RAG is designed to tackle - albeit an art on its own - that other AI methods have a much harder time accomplishing, even with ever-growing context windows. The way RAG does this is by chopping up all your documents into chunks: simply put small snippets from your document.

There are many ways you could chunk your documents and you will quite often find that for your specific use-case there is a very specific optimal way of handling your documents. That being said, RAG Me Up does try to standardize chunking with some sane defaults to start with, 3 in particular.

RecursiveCharacterTextSplitter

The RecursiveCharacterTextSplitter is a basic splitter from Langchain. It basically looks for different ways to split data into a chunk until it finds one that works. RAG Me Up just uses that Langchain splitter without modification and crucially sets the separators in-order, following the default that Langchain supplied already anyway. This means we try to split on paragraphs first, then newlines, punctuation, etc.

return RecursiveCharacterTextSplitter(
chunk_size=int(os.getenv("recursive_splitter_chunk_size")),
chunk_overlap=int(os.getenv("recursive_splitter_chunk_overlap")),
length_function=len,
keep_separator=True,
separators=[
"\n \n",
"\n\n",
"\n",
".",
"!",
"?",
" ",
",",
"\u200b",
"\uff0c",
"\u3001",
"\uff0e",
"\u3002",
"",
],
)

SemanticChunker

The SemanticChunker also comes from Langchain and basically embeds every sentence using the vectorization model given. It then merges sentences into a chunk as long as they are similar enough. The threshold for this can be specified yourself and RAG Me Up propagates it onwards from the .env file.

return SemanticChunker(
self.embeddings,
breakpoint_threshold_type=os.getenv("semantic_chunker_breakpoint_threshold_type"),
breakpoint_threshold_amount=os.getenv("semantic_chunker_breakpoint_threshold_amount"),
number_of_chunks=os.getenv("semantic_chunker_number_of_chunks"),
)

ParagraphChunker

Sometimes you just want to respect paragraph boundaries exactly. This cannot easily be achieved by Langchain's RecursiveCharacterTextSplitter as it doesn't respect only paragraph boundaries. The ParagraphChunker is specific to RAG Me Up and works in such a way that it tries to find as many whole paragraphs that fit within the chunk size, without breaking them. As soon as a paragraph needs to be split to fit in the (remainder of) chunk, it will be placed in a new chunk.

return ParagraphChunker(
max_chunk_size=int(os.getenv("paragraph_chunker_max_chunk_size")),
paragraph_separator=os.getenv("paragraph_chunker_paragraph_separator")
)

Implementation details

As mentioned, the ParagraphChunker is specific to RAG Me Up. It uses a specific regex (\n\s*\n by default) to find texts that are split by markup such as newlines (or double eg. newlines if you overwrite the default paragraph_separator). Then as long as paragraphs can be merged together without having more characters than the given max_chunk_size, they will be put together in a chunk. However, a paragraph's boundaries will never be violated. That means that if you have a single paragraph larger than max_chunk_size, it will not be chunked and kept intact, even though it violates the size.

Merged paragraphs are joined together with 2 newlines and the whole list of paragraphs is returned.