Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch phrase suggester is suggesting me suggestions that do not exists in my index

I have an Elasticsearch index where I have some data. I implemented and did-you-mean feature so when the user write something misspelled it could receive a suggestion with the right words.

I used the phrase suggester because I need suggestions for short phrases, like names for example, the problem is that some suggestions do not exists in the index.

Example:

document in the index: coding like a master
search: Codning like a boss
suggestion: <em>coding</em> like a boss
search result: not found

My problem is that, there are no phrase in my index that match the specified suggestion, so it's recommending me phrases that do not exists and thus will give me a not found search.

What can I do with this? Shouldn't phrase suggester give suggestions for phrases that actually exists in the index?

Here I'll leave the corresponding query, mapping and setting just in case you need it.

Setting and Mappings

{
  "settings": {
    "index": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "search.slowlog.threshold.fetch.warn": "2s",
      "index.analysis.analyzer.default.filter.0": "standard",
      "index.analysis.analyzer.default.tokenizer": "standard",
      "index.analysis.analyzer.default.filter.1": "lowercase",
      "index.analysis.analyzer.default.filter.2": "asciifolding",
      "index.priority": 3,
      "analysis": {
        "analyzer": {
          "suggests_analyzer": {
            "tokenizer": "lowercase",
            "filter": [
              "lowercase",
              "asciifolding",
              "shingle_filter"
            ],
            "type": "custom"
          }
        },
        "filter": {
          "shingle_filter": {
            "min_shingle_size": 2,
            "max_shingle_size": 3,
            "type": "shingle"
          }
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "suggest_field": {
          "analyzer": "suggests_analyzer",
          "type": "string"
        }
      }
    }
  }
}

Query

{
  "DidYouMean": {
    "text": "Codning like a boss",
    "phrase": {
      "field": "suggest_field",
      "size": 1,
      "gram_size": 1,
      "confidence": 2.0
    }
  }
}

Thanks for your help.

like image 208
Abraham Duran Avatar asked Jan 15 '16 21:01

Abraham Duran


1 Answers

This is expected actually. If you analyze your document with analyze api, you will get a better picture of what is happening.

GET suggest_index/_analyze?text=coding like a master&analyzer=suggests_analyzer

This is the output

{
   "tokens": [
      {
         "token": "coding",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 1
      },
      {
         "token": "coding like",
         "start_offset": 0,
         "end_offset": 11,
         "type": "shingle",
         "position": 1
      },
      {
         "token": "coding like a",
         "start_offset": 0,
         "end_offset": 13,
         "type": "shingle",
         "position": 1
      },
      {
         "token": "like",
         "start_offset": 7,
         "end_offset": 11,
         "type": "word",
         "position": 2
      },
      {
         "token": "like a",
         "start_offset": 7,
         "end_offset": 13,
         "type": "shingle",
         "position": 2
      },
      {
         "token": "like a master",
         "start_offset": 7,
         "end_offset": 20,
         "type": "shingle",
         "position": 2
      },
      {
         "token": "a",
         "start_offset": 12,
         "end_offset": 13,
         "type": "word",
         "position": 3
      },
      {
         "token": "a master",
         "start_offset": 12,
         "end_offset": 20,
         "type": "shingle",
         "position": 3
      },
      {
         "token": "master",
         "start_offset": 14,
         "end_offset": 20,
         "type": "word",
         "position": 4
      }
   ]
}

As you can see, there is a token "coding" generated for the text and hence it is in your index. It is not suggesting you something that is not in index.If you strictly want phrase search, then you might want to consider using keyword tokenizer. For e.g if you change your mapping to something like

{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "suggests_analyzer": {
            "tokenizer": "lowercase",
            "filter": [
              "lowercase",
              "asciifolding",
              "shingle_filter"
            ],
            "type": "custom"
          },
          "raw_analyzer": {
            "tokenizer": "keyword",
            "filter": [
              "lowercase",
              "asciifolding"
            ]
          }
        },
        "filter": {
          "shingle_filter": {
            "min_shingle_size": 2,
            "max_shingle_size": 3,
            "type": "shingle"
          }
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "suggest_field": {
          "analyzer": "suggests_analyzer",
          "type": "string",
          "fields": {
            "raw": {
              "analyzer": "raw_analyzer",
              "type": "string"
            }
          }
        }
      }
    }
  }
}

then this query will give you expected results

{
  "DidYouMean": {
    "text": "codning lke a master",
    "phrase": {
      "field": "suggest_field.raw",
      "size": 1,
      "gram_size": 1
    }
  }
}

it wont show anything for "codning like a boss".

EDIT 1

2) From your comments and also from running some phrase suggestions on my own dataset, I feel a much better approach would be to use collate option phrase suggester provides so that we can check every suggestion against a query and give back suggestion only if it is going to get back any document from index. I have also added stemmers to mapping to consider only root word. I am using light_english as it is less aggressive. More on that.

Analyzer part of mapping looks like this now

 "analysis": {
     "analyzer": {
         "suggests_analyzer": {
             "tokenizer": "standard",
             "filter": [
                 "lowercase",
                 "english_possessive_stemmer",
                 "light_english_stemmer",
                 "asciifolding",
                 "shingle_filter"
             ],
             "type": "custom"
         }
     },
     "filter": {
         "light_english_stemmer": {
             "type": "stemmer",
             "language": "light_english"
         },
         "english_possessive_stemmer": {
             "type": "stemmer",
             "language": "possessive_english"
         },
         "shingle_filter": {
             "min_shingle_size": 2,
             "max_shingle_size": 4,
             "type": "shingle"
         }
     }
 }

Now this query will give you desired results.

{
   "suggest" : {
     "text" : "appel on the tabel",
     "simple_phrase" : {
       "phrase" : {
         "field" :  "suggest_field",
         "size" :   5,
         "collate": {
           "query": { 
             "inline" : {
               "match_phrase": {
                   "{{field_name}}" : "{{suggestion}}" 
               }
             }
           },
           "params": {"field_name" : "suggest_field"}, 
           "prune": false
         }
       }
     }
   },
   "size": 0
 }

This will give you back apple on the table Here match_phrase query is used which will run every suggested phrase against index. You can make "prune" : true and see all results that have been suggested regardless of the match. You might want to consider using stop filter to avoid stopwords.

Hope this helps!!

like image 66
ChintanShah25 Avatar answered Nov 15 '22 11:11

ChintanShah25