I am doing text searching via ElasticSearch, and There is a problem on querying with term type. What I am doing below is basically,
So, Why it's happen? and how to resolve it.
➜  curl -XPOST 'http://localhost:9200/test/test/' -d '{ "name" : "你好" }'
{
  "ok": true,
  "_index": "test",
  "_type": "test",
  "_id": "VdV8K26-QyiSCvDrUN00Nw",
  "_version": 1
}
➜  curl -XGET 'http://localhost:9200/test/test/_mapping?pretty=1'
{
  "test" : {
    "properties" : {
      "name" : {
        "type" : "string"
      }
    }
  }
}
➜  curl -XGET 'http://localhost:9200/test/test/_search?pretty=1'
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.0,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "VdV8K26-QyiSCvDrUN00Nw",
        "_score": 1.0,
        "_source": {
          "name": "你好"
        }
      }
    ]
  }
}
➜  curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{
  "query": {
    "text": {
      "name": "你好"
    }
  }
}'
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.8838835,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "VdV8K26-QyiSCvDrUN00Nw",
        "_score": 0.8838835,
        "_source": {
          "name": "你好"
        }
      }
    ]
  }
}
➜  curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{
  "query": {
    "term": {
      "name": "你好"
    }
  }
}'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}
From the ElasticSearch docs about term query:
Matches documents that have fields that contain a term (not analyzed).
The name field is analyzed by default, so it can not be found by a term query (only finds not analyzed fields). You can try it and index another document with a different name (not Chinese) and it can also not be found by the term query. If you are now wondering why following search query return results though:
curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{"query" : {"term" : { "name" : "好" }}}'
Its because each token is a not analyzed term for that matter. If you would index a document with the name "你好吗", you would also not find documents containing "好吗" or "你好", but you could find documents containing "你", "好" or "吗" with a term query.
For Chinese you might need to pay special attention to the analyzer used. For me the standard analyzer seems good enough though (tokenize Chinese phrases on character by character basis, rather than space).
The default analyser is not suitable for asian languages. Try using an Analyzer like this: https://github.com/elasticsearch/elasticsearch-analysis-smartcn
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With