Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch shingles and stopwords

The example at https://www.elastic.co/guide/en/elasticsearch/guide/current/shingles.html mentions that the standard filter for stopwords introduces a negative effect when searching with shingles, due to the filter replacing stopwords with an underscore and producing tokens with underscores (which won't match "regular" text queries).

However, it suggests using a enable_position_increments parameter that is not supported by Lucene anymore (and produces an error at least on ES 2.4).

Is there anyway to solve this problem, or achieve the same results, without using the unsupported enable_position_increments? Or are the underscores a minor problem that can be worked around?

I was also thinking if this could be a non issue if you use the same analyzer for search and indexing: if a query includes stopwords, will they be replaced by _ and thus generate tokens that will match the indexed shingles (even if the stopwords were different)?

like image 924
jmng Avatar asked Feb 23 '17 12:02

jmng


2 Answers

I've found that a possible solution is to set the filler_token parameter to an empty string on the shingle filter, so the underscore will simply be omitted from the tokens:

"filter_shingle": {
                "type": "shingle",
                "max_shingle_size": 5,
                "min_shingle_size": 2,
                "output_unigrams": "false",
                "filler_token": ""
            }

Can someone comment on whether this achieves the same results, or if it creates any unforeseen problems concerning scoring or matching? The results from _analyze seem correct, the _ is omitted.

like image 71
jmng Avatar answered Sep 28 '22 06:09

jmng


I use this way to deal with this situation

"filter_shingle": {
                "type": "shingle",
                "max_shingle_size": 2,
                "min_shingle_size": 2,
                "output_unigrams": "true",
                "filler_token": ""
            }.

"analyzer":[   
  "my_shingle":{
    "filter":["lowercase","stop","filter_shingle","trim"],
    "tokenizer": "standard"
  }
]
like image 25
hxxxxx Avatar answered Sep 28 '22 06:09

hxxxxx