Elasticsearch on multiple fields with partial and full matches

Question

Our Account model has a first_name, last_name and a ssn (social security number).

I want to do partial matches on the first_name,last_name' but an exact match on ssn. I have this so far:

settings analysis: {
    filter: {
      substring: {
        type: "nGram",
        min_gram: 3,
        max_gram: 50
      },
      ssn_string: {
        type: "nGram",
        min_gram: 9,
        max_gram: 9
      },
    },
    analyzer: {
      index_ngram_analyzer: {
        type: "custom",
        tokenizer: "standard",
        filter: ["lowercase", "substring"]
      },
      search_ngram_analyzer: {
        type: "custom",
        tokenizer: "standard",
        filter:  ["lowercase", "substring"]
      },
      ssn_ngram_analyzer: {
        type: "custom",
        tokenizer: "standard",
        filter: ["ssn_string"]
      },
     }
   }

   mapping do
    [:first_name, :last_name].each do |attribute|
      indexes attribute, type: 'string', 
                         index_analyzer: 'index_ngram_analyzer',
                         search_analyzer: 'search_ngram_analyzer'
   end

   indexes :ssn, type: 'string', index: 'not_analyzed'

  end

My search is as follows:

query: {
  multi_match: {
     fields: ["first_name", "last_name", "ssn"],
     query: query,
     type: "cross_fields",
     operator: "and"
  }

}

So this works:

 Account.search("erik").records.to_a

and even (for Erik Smith):

 Account.search("erik smi").records.to_a

and the ssn:

 Account.search("111112222").records.to_a

but not:

 Account.search("erik 111112222").records.to_a

Any idea if I am indexing or querying wrong?

Thank you for any help!

Sloan Ahrens · Accepted Answer

Does it have to be done with a single query string? If not, I would do something like this:

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "ngram",
               "min_gram": 2,
               "max_gram": 20
            }
         },
         "analyzer": {
            "ngram_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "lowercase",
                  "ngram_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "_all": {
            "enabled": true,
            "index_analyzer": "ngram_analyzer",
            "search_analyzer": "standard"
         },
         "properties": {
            "first_name": {
               "type": "string",
               "include_in_all": true
            },
            "last_name": {
               "type": "string",
               "include_in_all": true
            },
            "ssn": {
               "type": "string",
               "index": "not_analyzed",
               "include_in_all": false
            }
         }
      }
   }
}

Notice the use of the_all field. I included first_name and last_name in _all, but not ssn, and ssn is not analyzed at all since I want to do exact matches against it.

I indexed a couple of documents for illustration:

POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"first_name":"Erik","last_name":"Smith","ssn":"111112222"}
{"index":{"_id":2}}
{"first_name":"Bob","last_name":"Jones","ssn":"123456789"}

Then I can query for the partial names, and filter by the exact ssn:

POST /test_index/doc/_search
{
   "query": {
      "filtered": {
         "query": {
            "match": {
               "_all": {
                   "query": "eri smi",
                   "operator": "and"
               }
            }
         },
         "filter": {
            "term": {
               "ssn": "111112222"
            }
         }
      }
   }
}

And I get back what I'm expecting:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.8838835,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.8838835,
            "_source": {
               "first_name": "Erik",
               "last_name": "Smith",
               "ssn": "111112222"
            }
         }
      ]
   }
}

If you need to be able to do the search with a single query string (no filter), you could include ssn in the all field as well, but with this setup it will also match on partial strings (like 111112) so that may not be what you want.

If you only want to match prefixes (i.e., search terms that start at the beginning of the words), you should use edge ngrams.

I wrote a blog post about using ngrams which might help you out a little: http://blog.qbox.io/an-introduction-to-ngrams-in-elasticsearch

Here is the code I used for this answer. I tried a few different things, including the setup I posted here, and another inluding ssn in _all, but with edge ngrams. Hope this helps:

http://sense.qbox.io/gist/b6a31c929945ef96779c72c468303ea3bc87320f

Elasticsearch on multiple fields with partial and full matches

Tags:

elasticsearch

axiom_chicago

1 Answers

Sloan Ahrens

Recent Activity

Donate For Us

Elasticsearch on multiple fields with partial and full matches

Tags:

elasticsearch

axiom_chicago

1 Answers

Sloan Ahrens

Related questions

Recent Activity

Donate For Us