What is a good way to store NLP-Embeddings (nparrays plus information) [closed]

I have huge text datasets (500.000+ Documents) and I want to store embeddings for all sentences or paragraphs in a document. An embeddings is a numpy array with 768 entries.

I know that one can easily write numpy arrays to disk, but I also need to store additional information for these embeddings, namely which sentence/paragraph do they represent and the document in which the sentence occurs. I thought about storing all these information in a (PostgreSQL) database, however I fear that searching for vectors/embeddings might be slow. The application is similarity search, so finding the most similar vectors to a query.
What is the best way of storing these vectors and their corresponding info? Is it efficient to store python tuples, in this case (document_ID, sentence_as_string, sentence_embedding)? Does a postgres database do the job?
I have also thought about storing all embeddings as a numpy matrix in a .npy file and store just
the row number for the embedding in the database. This would mean loading all embeddings into memory, but I feel like this might be the best for performance. Is it 'messy'? Are there best practices about storing numpy arrays plus additional information?

Edit (Additional Info):
I have several datasets, like the Enron Corpus, which I want to split into sentences or paragraphs. Let's call them units. For each unit, I want to calculate a sentence embedding. These Vectors have 768 dimension. As I want to search for the most similar vectors, I need to calculate the cosine-similarity between all vectors. I would also like to calculate the cosine-similarity between all vectors and the embedding of a search query, which makes the comparison between all vectors necessary.
Now my question is how to store these information effectively. The application seems to fit a classic relational database scheme. A document consists of several units, each unit has a text field. I suppose that one could also store a 768-dimensional vector as an entry in the database, so a unit can also have its embedding stored. However, I fear that calculating the cosine-similarity might be pretty slow inside the database, compared to having all embeddings in memory. But when I store all embeddings as a numpy array and load them into the memory, I lose the info on what unit produced which emebedding. So my question is, how to best store this large amount of 768-dimensional vectors and their corresponding information.
Calculating embeddings is expensive. I want to do it only once. So the workflow is:

  1. split all the documents into units (Text, Meta-Information as Text)
  2. calculate the embeddings for all units (Numpy-Arrays)
  3. store them
  4. be able to search them

Storing them is what gives me headaches.

Further endeavors:
I have already set up the database without the embeddings. Afterwards I investigated how one would store a numpy array inside a postgres-DB. Apparently, one has to serialize it to JSON. This makes calculating the cosine-similarity inside the Database pretty much impossible(or at least impossibly slow) AFAIK. I do not believe that it's worth the time to put all my embeddings into a postgresDB right now. There also seem to be some google courses about working with embeddings, which I will check out.

1 Answers

[For Python] Storing all of the embeddings in memory at runtime would not be great idea. Instead, after you calculate the embeddings, save them into a file and whenever you want to search for the 'most similar phrase', run through the file one line at a time, calculate the cosine similarity score, and keep track of the max score and the sentence corresponding to that embedding (you could structure the file as a json). Doing it in this manner should allow the program to be able to search through all the embeddings without loading every single embedding into memory.

