Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does ElasticSearch support Unicode / Chinese?

I am doing text searching via ElasticSearch, and There is a problem on querying with term type. What I am doing below is basically,

  1. Add a document with Chinese string (你好).
  2. Querying with text method, and it return the document.
  3. Querying with term method, return nothing.

So, Why it's happen? and how to resolve it.

➜  curl -XPOST 'http://localhost:9200/test/test/' -d '{ "name" : "你好" }'

{
  "ok": true,
  "_index": "test",
  "_type": "test",
  "_id": "VdV8K26-QyiSCvDrUN00Nw",
  "_version": 1
}

➜  curl -XGET 'http://localhost:9200/test/test/_mapping?pretty=1'

{
  "test" : {
    "properties" : {
      "name" : {
        "type" : "string"
      }
    }
  }
}

➜  curl -XGET 'http://localhost:9200/test/test/_search?pretty=1'

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.0,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "VdV8K26-QyiSCvDrUN00Nw",
        "_score": 1.0,
        "_source": {
          "name": "你好"
        }
      }
    ]
  }
}

➜  curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{
  "query": {
    "text": {
      "name": "你好"
    }
  }
}'

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.8838835,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "VdV8K26-QyiSCvDrUN00Nw",
        "_score": 0.8838835,
        "_source": {
          "name": "你好"
        }
      }
    ]
  }
}

➜  curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{
  "query": {
    "term": {
      "name": "你好"
    }
  }
}'

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}
like image 633
kerwin Avatar asked Nov 11 '13 10:11

kerwin


2 Answers

From the ElasticSearch docs about term query:

Matches documents that have fields that contain a term (not analyzed).

The name field is analyzed by default, so it can not be found by a term query (only finds not analyzed fields). You can try it and index another document with a different name (not Chinese) and it can also not be found by the term query. If you are now wondering why following search query return results though:

curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{"query" : {"term" : { "name" : "好" }}}'

Its because each token is a not analyzed term for that matter. If you would index a document with the name "你好吗", you would also not find documents containing "好吗" or "你好", but you could find documents containing "你", "好" or "吗" with a term query.

For Chinese you might need to pay special attention to the analyzer used. For me the standard analyzer seems good enough though (tokenize Chinese phrases on character by character basis, rather than space).

like image 87
Torsten Engelbrecht Avatar answered Nov 01 '22 23:11

Torsten Engelbrecht


The default analyser is not suitable for asian languages. Try using an Analyzer like this: https://github.com/elasticsearch/elasticsearch-analysis-smartcn

like image 1
Nemesis Avatar answered Nov 02 '22 01:11

Nemesis