Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I do a partial match in Elasticsearch?

I have a link like http://drive.google.com and I want to match "google" out of the link.

I have:

query: {     bool : {         must: {             match: { text: 'google'}          }     } } 

But this only matches if the whole text is 'google' (case insensitive, so it also matches Google or GooGlE etc). How do I match for the 'google' inside of another string?

like image 435
ThePumpkinMaster Avatar asked Jun 08 '16 17:06

ThePumpkinMaster


People also ask

What is partial string matching?

(A partial match occurs if the whole of the element of x matches the beginning of the element of table .) Finally, all remaining elements of x are regarded as unmatched. In addition, an empty string can match nothing, not even an exact match to an empty string.

What is partial match search?

The partial match feature allows the index to return items that only contain a subset of the keywords entered by the end user. 1. This ensures that relevant items which only contain some of the query keywords are returned, and reduces the chance of receiving no results in the response.

What is match phrase in Elasticsearch?

Match phrase queryeditA phrase query matches terms up to a configurable slop (which defaults to 0) in any order. Transposed terms have a slop of 2. The analyzer can be set to control which analyzer will perform the analysis process on the text.


1 Answers

The point is that the ElasticSearch regex you are using requires a full string match:

Lucene’s patterns are always anchored. The pattern provided must match the entire string.

Thus, to match any character (but a newline), you can use .* pattern:

match: { text: '.*google.*'}                 ^^      ^^ 

In ES6+, use regexp insted of match:

"query": {    "regexp": { "text": ".*google.*"}  } 

One more variation is for cases when your string can have newlines: match: { text: '(.|\n)*google(.|\n)*'}. This awful (.|\n)* is a must in ElasticSearch because this regex flavor does not allow any [\s\S] workarounds, nor any DOTALL/Singleline flags. "The Lucene regular expression engine is not Perl-compatible but supports a smaller range of operators."

However, if you do not plan to match any complicated patterns and need no word boundary checking, regex search for a mere substring is better performed with a mere wildcard search:

{     "query": {         "wildcard": {             "text": {                 "value": "*google*",                 "boost": 1.0,                 "rewrite": "constant_score"             }         }     } }  

See Wildcard search for more details.

NOTE: The wildcard pattern also needs to match the whole input string, thus

  • google* finds all strings starting with google
  • *google* finds all strings containing google
  • *google finds all strings ending with google

Also, bear in mind the only pair of special characters in wildcard patterns:

?, which matches any single character *, which can match zero or more characters, including an empty one 
like image 57
Wiktor Stribiżew Avatar answered Sep 19 '22 08:09

Wiktor Stribiżew