Data ingestion and indexing
RAG Me Up only supports hybrid vectorization at this time, which means it stores and uses both dense vectors (semantical embeddings like BERT) and sparse vectors (keyword-based search like BM25) becuase from experience we feel using both semantic-heavy as well as keyword-based search is the sweetspot for almost all RAG applications.
All interfacing with the Postgres hybrid retrieval database happens in the PostgresHybridRetriever.
Database setup
To make sure both the semantic retrieval queries as well as the BM25 queries perform well, we make use of pg_vector which efficiently indexes dense vectors (embeddings) and pgsearch which efficiently creates sparse BM25 indexes. Both are Postgres extensions that are available in the ParadeDB image of Postgres.
One of the most crucial aspects we need to ensure before doing anything though, is make sure the tables to hold our data together with indexes get created before we ingest data.
Dense setup
For the dense vectors, we create an embedding table like so:
CREATE TABLE IF NOT EXISTS ragmeup_dense_embeddings (
id VARCHAR(32) PRIMARY KEY,
embedding public.vector({embedding_dimension}) NOT NULL,
content varchar NOT NULL,
metadata jsonb NOT NULL
);
The crucial part of this table is the embedding column which is of type vector. This is where our embeddings will be held and depending on which embedding model we are using, its size may differ, which is why we load the embedding dimension as parameter. The content column will contain the actual data (text) of the chunk (important: not document, we store chunks). The metadata column will contain any JSON value that will hold more information of the chunk, like, for example, the source filename or anything you want to have added to it that isn't part of the content or vector itself (note: RAG Me Up does NOT offer a non-code way to extend metadata, you'll have to do this yourself).
Next we need to create the vector index:
CREATE INDEX IF NOT EXISTS ragmeup_dense_embedding_index ON ragmeup_dense_embeddings USING hnsw (embedding vector_cosine_ops) WITH (m='16', ef_construction='64');
HNSW is an efficient indexing algorithm that has parameters m and ef that can be tweaked but the defaults of 16 and 64 are reasonable starting points.
Sparse setup
For sparse retrieval, the table structure is similar but since we don't have an embedding, we just store the actual text and metadata.
CREATE TABLE IF NOT EXISTS ragmeup_sparse_embeddings (
id VARCHAR(32) PRIMARY KEY,
content TEXT,
metadata JSONB
);
A couple of things are worth noting here:
- We could have chosen to use the dense table for sparse retrieval too by simply adding an index to the dense table's
contentcolumn. The problem with this however, is that we can then no longer create a pure metric to see whether we've gotten results from dense or sparse retrieval, nor weight them, might we choose to do so. - Both tables contain metadata, this is redundant information but required because we may retrieve data either via dense or via sparse retrieval and we do want the full metadata regardless.
Once this table is created, we also need to create an index but using a simple CREATE INDEX IF NOT EXISTS has proven to be dodgy in pgsearch, so instead we check for existence ourselves.
DO $$
BEGIN
IF NOT EXISTS (
SELECT 1
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = lower('ragmeup_sparse_embeddings_bm25')
AND n.nspname = 'public'
AND c.relkind = 'i'
) THEN
CREATE INDEX ragmeup_sparse_embeddings_bm25 ON ragmeup_sparse_embeddings USING bm25 (id, content) WITH (key_field='id');
CREATE INDEX idx_metadata_dense_dataset ON ragmeup_dense_embeddings (((metadata::jsonb ->> 'dataset')::text));
CREATE INDEX idx_metadata_sparse_dataset ON ragmeup_sparse_embeddings (((metadata::jsonb->>'dataset')::text));
END IF;
END $$;
Note that we also create BM25 search indexes on the metadata fields of both the dense and the sparse tables. While these aren't actively used in RAG Me Up during retrieval, they do pave the way for also applying search on metadata next to content, might you need this for your usecase.