Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

elasticsearch multi-word keyword-tokenized synonym analysis

I am trying to get keyword-tokenized multi-word synonyms working with the _analyze API. The API is returning expected results for single-word synonyms, however, not for multi-word ones. Here is my settings and analysis chain:

curl -XPOST "http://localhost:9200/test" -d'
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_syn_filt": {
            "type": "synonym",
            "synonyms": [
              "foo bar, fooo bar", 
              "bazzz, baz"
            ]
          }
        },
        "analyzer": {
          "my_synonyms": {
            "filter": [
              "lowercase",
              "my_syn_filt"
            ],
            "tokenizer": "keyword"
          }
        }
      }
    }
  }
}'

Now test using the _analyze API:

curl 'localhost:9200/test/_analyze?analyzer=my_synonyms&text=baz'

The call returns what I expect (the same result is returned for 'bazzz' as well):

{
  "tokens": [
    {
      "position": 1,
      "type": "SYNONYM",
      "end_offset": 3,
      "start_offset": 0,
      "token": "bazzz"
    },
    {
      "position": 1,
      "type": "SYNONYM",
      "end_offset": 3,
      "start_offset": 0,
      "token": "baz"
    }
  ]
}

Now when I try the same call with the multi-word synonym text the API only returns one token of type 'word', no synonyms:

curl 'localhost:9200/test/_analyze?analyzer=my_synonyms&text=foo+bar'

(returns)

{
  "tokens": [
    {
      "position": 1,
      "type": "word",
      "end_offset": 7,
      "start_offset": 0,
      "token": "foo bar"
    }
  ]
}

Why isn't the analyze API returning both "foo bar" AND "fooo bar" tokens with type SYNONYM?

like image 207
Jeff Avatar asked Aug 08 '14 15:08

Jeff


1 Answers

The "tokenizer":"keyword" key-value ALSO needs to be added to the my_syn_filt filter declaration as follows:

curl -XPOST "http://localhost:9200/test" -d'
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_syn_filt": {
            "tokenizer": "keyword",
            "type": "synonym",
            "synonyms": [
              "foo bar, fooo bar", 
              "bazzz, baz"
            ]
          }
        },
        "analyzer": {
          "my_synonyms": {
            "filter": [
              "lowercase",
              "my_syn_filt"
            ],
            "tokenizer": "keyword"
          }
        }
      }
    }
  }
}'

With the above mapping the _analyze API returns the desired SYNONYM tokens:

{
  "tokens": [
    {
      "position": 1,
      "type": "SYNONYM",
      "end_offset": 7,
      "start_offset": 0,
      "token": "foo bar"
    },
    {
      "position": 1,
      "type": "SYNONYM",
      "end_offset": 7,
      "start_offset": 0,
      "token": "fooo bar"
    }
  ]
}
like image 135
Jeff Avatar answered Oct 02 '22 22:10

Jeff