I need to create a 'search engine' experience : from a short query (few words), I need to find the relevant documents in a corpus of thousands documents.
After analyzing few approaches, I got very good results with the Universal Sentence Encoder from Google. The problem is that my documents can be very long. For these very long texts it looks like the performance are decreasing so my idea was to cut the text in sentences/paragraph.
So I ended up with getting a list of vectors for each document (representing each part of the document).
My question is : is there a state-of-the-art algorithm/methodology to compute a scoring from a list of vector ? I don't really want to merge them into one as it would create the same effect than before (the relevant part would be diluted in the document). Any scoring algorithms to sum up the multiple cosine similarities between the query and the different parts of the text ?
important information : I can have short and long text. So I can have 1 up to 10 vectors for a document.
One way of doing this is to embed all sentences of all documents (typically storing them in an index such as FAISS or elastic). Store the document identifier of each sentence. In Elastic this can be metadata but in FAISS this needs to be held in an external mapping. Then:
Then you should have an ordered list of relevant document identifiers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With