Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling large search queries on relatively small index documents in Lucene

Tags:

java

lucene

I'm working on a project where we index relatively small documents/sentences, and we want to search these indexes using large documents as query. Here is a relatively simple example : I'm indexing document :

docId : 1
text: "back to black"

And i want to query using the following input :

"Released on 25 July 1980, Back in Black was the first AC/DC album recorded without former lead singer Bon Scott, who died on 19 February at the age of 33, and was dedicated to him."

What is the best approach for this in Lucene ? For simple examples, where the text i want to find is exactly the input query, i get better results using my own analyzer + a PhraseQuery than using QueryParser.parse(QueryParser.escape(...my large input...)) - which ends up creating a big Boolean/Term Query.

But i can't try to use a PhraseQuery approach for a real world example, i think i have to use a word N-Gram approach like the ShingleAnalyzerWrapper but as my input documents can be quite large the combinatorics will become hard to handle...

In other words, i'm stuck and any idea would be greatly appreciated :)

P.S. i didn't mention it but one of the annoying thing with indexing small documents is also that due to "norms"-value (float) being encoded on only 1 byte, all 3-4 words sentences get the same Norm Value, so searching sentences like "A B C" makes results "A B C" and "A B C D" show up with the same score.

Thanks !

like image 395
Olivier Girardot Avatar asked Nov 03 '22 19:11

Olivier Girardot


1 Answers

I don't know how many sentences you have, but you may want to inverse the problem: store your sentences as queries, index incoming documents in a transient in-memory index and run all your queries on it to find the matching ones.

(Note: this is how Elasticsearch's percolator works.)

Edit (2013-06-21):

If you have a very large number of sentences, it might still be better to store sentences in an index. But instead of using phrase queries, you could try to index using Lucene's ShingleFilter. At query time, your approach to build the query manually instead of using QueryParser is the good one, but if you index shingles, you could just build a pure boolean query where each clause matches a shingle instead of a phrase query.

like image 118
jpountz Avatar answered Nov 14 '22 23:11

jpountz