Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lucene Proximity Search for phrase with more than two words

Lucene's manual has explained the meaning of proximity search for a phrase with two words clearly, such as the "jakarta apache"~10 example in http://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Proximity Searches

However, I am wondering what does a search like "jakarta apache lucene"~10 exactly do? Does it allow neighboring words to be at most 10 words apart, or all pairs of words to be that?

Thanks!

like image 766
dwdwdw Avatar asked Aug 28 '14 21:08

dwdwdw


People also ask

How do you use the wildcard in Lucene?

Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries). To perform a single character wildcard search use the "?" symbol. To perform a multiple character wildcard search use the "*" symbol. You can also use the wildcard searches in the middle of a term.

What are Lucene special characters?

You can't search for special characters in Lucene Search. These are + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ / @.

How do you search Lucene?

Step 1 − Create object of IndexSearcher. Step 2 − Create a Lucene directory which should point to location where indexes are to be stored. Step 3 − Initialize the IndexSearcher object created with the index directory.

What is Lucene query syntax?

What is Lucene Query Syntax? Lucene is a query language that can be used to filter messages in your PhishER inbox. A query written in Lucene can be broken down into three parts: Field The ID or name of a specific container of information in a database.


1 Answers

The slop (proximity) works like an edit distance (see PhraseQuery.setSlop). So, the terms could be reordered or have extra terms added. This means that the proximity would be the maximum number of terms added into the whole query. That is:

"jakarta apache lucene"~3

Will match:

  • "jakarta lucene apache" (distance: 2)
  • "jakarta extra words here apache lucene" (distance: 3)
  • "jakarta some words apache separated lucene" (distance: 3)

But not:

  • "lucene jakarta apache" (distance: 4)
  • "jakarta too many extra words here apache lucene" (distance: 5)
  • "jakarta some words apache further separated lucene" (distance: 4)

Some people have been confused by:

"lucene jakarta apache" (distance: 4)

The simple explanation is that swapping terms takes two edits, so:

  1. jakarta apache lucene (distance: 0)
  2. jakarta lucene apache (first swap, distance: 2)
  3. lucene jakarta apache (second swap, distance: 4)

The longer, but more accurate, explanation is that every edit allows a term to be moved by one position. The first move of a swap transposes two terms on top of each other. Keeping this in mind explains why any set of three terms can be rearranged into any order with distance no greater than 4.

  1. jakarta apache lucene (distance: 0)
  2. jakarta [apache,lucene] (distance: 1)
  3. [jakarta,apache,lucene] (all transposed at the same position, distance: 2)
  4. lucene [jakarta,apache] (distance: 3)
  5. lucene jakarta apache (distance: 4)
like image 194
femtoRgon Avatar answered Sep 21 '22 14:09

femtoRgon