I'm currently indexing webpage using lucene. The aim is to be able to quickly extract which page contain a certain expression (usually 1, 2 or 3 words), and which other words (or group of 1to 3 of them) are also in the page. This will be used to build / enrich / alter a thesaurus (fixed vocabulary).
From the articles I found, it seems the problem is to find n-grams (or shingle).
Lucene has a ShingleFilter, a ShingleMatrixFilter, and a ShingleAnalyzerWrapper, which seem related to this task.
From this presentation, I learned that Lucene can also search for terms separated by a fixed number of words (called slops). An example is provided here.
However, I don't understand clearly the difference between those approach? Are they fundamentally different, or is it a performance / index size choice that you have to make?
What is the difference between ShingleMatrixFilter and ShingleFilter?
Hope a Lucene guru will FIND this question, and and answer ;-) !
The differences between using phrase versus shingle mainly involve performance and scoring.
When using phrase queries (say "foo bar") in the typical case where single words are in the index, phrase queries have to walk the inverted index for "foo" and for "bar" and find the documents that contain both terms, then walk their positions lists within each one of those documents to find the places where "foo" appeared right before "bar".
This has some cost to both performance and scoring:
On the other hand, if you use shingles, you are also indexing word n-grams, in other words, if you are shingling up to size 2, you will also have terms like "foo bar" in the index. This means for this phrase query, it will be parsed as a simple TermQuery, without using any positions lists. And since its now a "real term", the phrase IDF will be exact, because we know exactly how many documents this "term" exists.
But using shingles has some costs as well:
In general, indexing word-ngrams with things like Shingles or CommonGrams is just a tradeoff (fairly expert), to reduce the cost of positional queries or to enhance phrase scoring.
But there are real-world use cases for this stuff, a good example is available here: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With