Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add stopwords to the default list in ElasticSearch

I want to add more words to the default "english" stopwards, e.g., "inc", "incorporated", "ltd" and "limited". How can I achieve this?

My current code to create an index is as follows. Thanks.

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_stop": {
          "type": "stop",
          "stopwords": "_english_"
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "whitespace",
           "char_filter": [
            "html_strip"
          ],
          "filter": [ 
            "lowercase",
            "asciifolding",
            "my_stop"
          ]
        }
      }
    }
  }
}

My test code

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "House of Dickson<br> corp"
}
like image 1000
Redzon Avatar asked Jul 09 '17 15:07

Redzon


2 Answers

I've been able to combine custom stopwords with the standard English using the following:

{
    "analysis": {
        "analyzer": {
            "my_analyzer": {
                "tokenizer": "standard",
                "filter": [
                    "custom_stop",
                    "english_stop"
                ]
            }
        },
        "filter": {
            "custom_stop": {
                "type":       "stop",
                "stopwords": ["custom1","custom2","custom3"]
            },
            "english_stop": {
                "type":       "stop",
                "stopwords":  "_english_"
            }
        }
    }
}
like image 108
Matt Saunders Avatar answered Nov 15 '22 05:11

Matt Saunders


The set of "english" stopwords is the same as the set in Standard Analyzer.

You can create a file with these words and your additional stopwords and use stopwords_path option to point to this file (instead of stopwords setting):

{
  "settings": {
    "analysis": {
      "filter": {
        "my_stop": {
          "type": "stop",
          "stopwords_path": "stopwords/custom_english.txt"
        }
      },
      ...
}

You can find more information how the file should look like in ES-docs (UTF-8, single stopword per line, file present on all nodes).

like image 35
Joanna Avatar answered Nov 15 '22 07:11

Joanna