I have huge text datasets (500.000+ Documents) and I want to store embeddings for all sentences or paragraphs in a document. An embeddings is a numpy array with 768 entries.
I know that one can easily write numpy arrays to disk, but I also need to store additional information for these embeddings, namely which sentence/paragraph do they represent and the document in which the sentence occurs.
I thought about storing all these information in a (PostgreSQL) database, however I fear that searching for vectors/embeddings might be slow. The application is similarity search, so finding the most similar vectors to a query.
What is the best way of storing these vectors and their corresponding info? Is it efficient to store python tuples, in this case (document_ID, sentence_as_string, sentence_embedding)? Does a postgres database do the job?
I have also thought about storing all embeddings as a numpy matrix in a .npy file and store just
the row number for the embedding in the database. This would mean loading all embeddings into memory, but I feel like this might be the best for performance. Is it 'messy'? Are there best practices about storing numpy arrays plus additional information?
Edit (Additional Info):
I have several datasets, like the Enron Corpus, which I want to split into sentences or paragraphs. Let's call them units. For each unit, I want to calculate a sentence embedding. These Vectors have 768 dimension. As I want to search for the most similar vectors, I need to calculate the cosine-similarity between all vectors. I would also like to calculate the cosine-similarity between all vectors and the embedding of a search query, which makes the comparison between all vectors necessary.
Now my question is how to store these information effectively. The application seems to fit a classic relational database scheme. A document consists of several units, each unit has a text field. I suppose that one could also store a 768-dimensional vector as an entry in the database, so a unit can also have its embedding stored. However, I fear that calculating the cosine-similarity might be pretty slow inside the database, compared to having all embeddings in memory. But when I store all embeddings as a numpy array and load them into the memory, I lose the info on what unit produced which emebedding. So my question is, how to best store this large amount of 768-dimensional vectors and their corresponding information.
Calculating embeddings is expensive. I want to do it only once. So the workflow is:
Storing them is what gives me headaches.
Further endeavors:
I have already set up the database without the embeddings. Afterwards I investigated how one would store a numpy array inside a postgres-DB. Apparently, one has to serialize it to JSON. This makes calculating the cosine-similarity inside the Database pretty much impossible(or at least impossibly slow) AFAIK. I do not believe that it's worth the time to put all my embeddings into a postgresDB right now. There also seem to be some google courses about working with embeddings, which I will check out.
Storing & Loading Embeddings The easiest method is to use pickle to store pre-computed embeddings on disc and to load it from disc. This can especially be useful if you need to encode large set of sentences.
In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.
ELMo is an NLP framework developed by AllenNLP. ELMo word vectors are calculated using a two-layer bidirectional language model (biLM). Each layer comprises forward and backward pass. Unlike Glove and Word2Vec, ELMo represents embeddings for a word using the complete sentence containing that word.
[For Python] Storing all of the embeddings in memory at runtime would not be great idea. Instead, after you calculate the embeddings, save them into a file and whenever you want to search for the 'most similar phrase', run through the file one line at a time, calculate the cosine similarity score, and keep track of the max score and the sentence corresponding to that embedding (you could structure the file as a json). Doing it in this manner should allow the program to be able to search through all the embeddings without loading every single embedding into memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With