Embeddings
One of the most crucial steps of any AI is converting whatever information we have at hand (eg. text, image, video, etc.) into a numerical representation such that we can perform mathematical operations on that information - similarity calculation being the most notable operation for RAG purposes. We do this by applying trained models on our data that convert the information into embeddings. An embedding is a vector containing N (also called the dimension) numbers to represent each piece information. In case of RAG Me Up, embedding models turn textual input from our chunking step into vector representations.
If we just look at naive and plain RAG, using those vectors is sufficient to get a working pipeline. When constructed with an LM or LLM1, these vectors are called dense vectors and aim to preserve semantics and/or syntax2. Since we are discussing the application of RAG, we should be looking for any given model that optimizes in retrieval. Luckily for us, there are different leaderboards where we can find models scored against benchmarks for a plethora of tasks, including retrieval. One such leaderboard is the MTEB leaderboard at Huggingface. The choice of model to use for our embedding task is usually a trade-off between the reported retrieval accuracy column and the size of the model in embedding dimension or max tokens. The latter is important because the embedding step, while crucial for indexing, is a small gear in the full chain of applying RAG and any overly heavy burden on memory or throughput usage, should be considered.
As an alternative to using open source (Huggingface) embedding models, you could also use one from cloud providers like OpenAI, Google Gemini, Anthropic, etc. Currently however, RAG Me Up does not natively support doing this, so you would need to modify the code to make that work.
Hybrid embeddings
While creating dense vectors is the basic go-to step of any (naive) RAG setup, it has its limitations. Most notably, embedding models work really well in capturing the context of a given input sentence, but users tend to typically provide queries in the form of just keywords, potentially in non-sensical order or gramatically-ill3. Luckily there is a rich history in natural language processing before AI and keyword-based search is a well-developed field. One very commonly used technique in retrieval (and hence indexing) is to apply hybrid-search where we combine the dense vectors obtained by embeddings models with sparse vectors that capture whether specific (key)words are present or not, optimizing for keyword-based search.
RAG Me Up always applies hybrid-search, there is currently no way to only use dense or only sparse search without modifying the code.
Creating the embeddings
In the previous step, we have create a bunch of chunks of our documents. Each of these chunks will need to be embedded using the embedding model of choice and stored in our Postgres database. In our RAGHelper.py class' load_data
function, you will see that the data is pre-processed into a standard format with some metadata first:
chunks = [{
"id": hashlib.md5(chunk.encode()).hexdigest(),
"embedding": self.embeddings.encode(chunk),
"content": chunk,
"metadata": json.dumps({
"source": file,
"dataset": subfolder
})
} for chunk in chunks]
What we do here is create a unique ID based on a hash of the actual chunk's content. One implication of this approach is that duplicate chunks with the same content, will not occur twice in our database. This is a conscious choice that may or may not be useful for your application but currently this is how RAG Me Up forces unique content.
A very important step is in this snippet above where we actually create the dense embeddings by calling self.embeddings.encode(chunk)
. This command invokes a Huggingface embedding model's call to actually create the embedding. The initialization of this model is done in the initialize_embeddings
function where we simply load the model provided in the .env
file through Huggingface's SentenceTransformers library.
The magic line of code that writes these chunks out to the hyrbid database is this command: self.retriever.add_documents(documents)
. The RAGHelper
class hols an instance of the PostgresHybridRetriever class that is responsible for all database management and related operations. In the next section we will deeper explore the inner-workings of this class.
Footnotes
-
An LM is a Language Model and and LLM is a Large Language Model. While these terms are fluid and often used in non-exhaustive contexts, in general an LM models language only, performing tasks like Masked-Languge-Modeling and Next-Sentence-Prediction (the BERT model is the most famous example of this) and do not generate new text. LLMs can be used in a variety of tasks and confusingly enough also vectorize just like an LM does, but instead these models are often used in a generative setting to spit out new text. ↩
-
While embedding models often aim at preserving (to some extent) both semantics and syntax, some really excel at one or the other task. It is wise to pay attention to what the goal of a given pre-trained model is and to see if it aligns with what we are going to use it for. ↩
-
An interesting, often overlooked discussion, is that we, as humans, are not adept at using AI well enough to actually excel at posing our questions well. We come from a history of using keyword-search through search platforms like Google but for AI - and RAG in particular - this is not the best vehicle. Educating users of a RAG system on how to properly ask questions (a prompt-engineering-light) is just as crucial a part of any RAG implementation as the technology itself. ↩