Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Universal sentence encoder for big document similarity

I need to create a 'search engine' experience : from a short query (few words), I need to find the relevant documents in a corpus of thousands documents.

After analyzing few approaches, I got very good results with the Universal Sentence Encoder from Google. The problem is that my documents can be very long. For these very long texts it looks like the performance are decreasing so my idea was to cut the text in sentences/paragraph.

So I ended up with getting a list of vectors for each document (representing each part of the document).

My question is : is there a state-of-the-art algorithm/methodology to compute a scoring from a list of vector ? I don't really want to merge them into one as it would create the same effect than before (the relevant part would be diluted in the document). Any scoring algorithms to sum up the multiple cosine similarities between the query and the different parts of the text ?

important information : I can have short and long text. So I can have 1 up to 10 vectors for a document.

like image 497
bladeous Avatar asked Dec 23 '19 17:12

bladeous


1 Answers

One way of doing this is to embed all sentences of all documents (typically storing them in an index such as FAISS or elastic). Store the document identifier of each sentence. In Elastic this can be metadata but in FAISS this needs to be held in an external mapping. Then:

  1. embed query
  2. Calculate cosine similarity between query and all sentence embeddings
  3. For top-k results, group by document identifier and take the sum (this step is optional depending on whether youre looking for the most similar document or the most similar sentence, here I suppose that you are looking for the most similar document, thereby boosting documents with a higher similarity).

Then you should have an ordered list of relevant document identifiers.

like image 157
user17124924 Avatar answered Dec 01 '22 01:12

user17124924