Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch on multiple fields with partial and full matches

Our Account model has a first_name, last_name and a ssn (social security number).

I want to do partial matches on the first_name,last_name' but an exact match on ssn. I have this so far:

settings analysis: {
    filter: {
      substring: {
        type: "nGram",
        min_gram: 3,
        max_gram: 50
      },
      ssn_string: {
        type: "nGram",
        min_gram: 9,
        max_gram: 9
      },
    },
    analyzer: {
      index_ngram_analyzer: {
        type: "custom",
        tokenizer: "standard",
        filter: ["lowercase", "substring"]
      },
      search_ngram_analyzer: {
        type: "custom",
        tokenizer: "standard",
        filter:  ["lowercase", "substring"]
      },
      ssn_ngram_analyzer: {
        type: "custom",
        tokenizer: "standard",
        filter: ["ssn_string"]
      },
     }
   }

   mapping do
    [:first_name, :last_name].each do |attribute|
      indexes attribute, type: 'string', 
                         index_analyzer: 'index_ngram_analyzer',
                         search_analyzer: 'search_ngram_analyzer'
   end

   indexes :ssn, type: 'string', index: 'not_analyzed'

  end 

My search is as follows:

query: {
  multi_match: {
     fields: ["first_name", "last_name", "ssn"],
     query: query,
     type: "cross_fields",
     operator: "and"
  }

}

So this works:

 Account.search("erik").records.to_a

and even (for Erik Smith):

 Account.search("erik smi").records.to_a

and the ssn:

 Account.search("111112222").records.to_a

but not:

 Account.search("erik 111112222").records.to_a

Any idea if I am indexing or querying wrong?

Thank you for any help!

like image 927
axiom_chicago Avatar asked Nov 01 '22 06:11

axiom_chicago


1 Answers

Does it have to be done with a single query string? If not, I would do something like this:

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "ngram",
               "min_gram": 2,
               "max_gram": 20
            }
         },
         "analyzer": {
            "ngram_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "lowercase",
                  "ngram_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "_all": {
            "enabled": true,
            "index_analyzer": "ngram_analyzer",
            "search_analyzer": "standard"
         },
         "properties": {
            "first_name": {
               "type": "string",
               "include_in_all": true
            },
            "last_name": {
               "type": "string",
               "include_in_all": true
            },
            "ssn": {
               "type": "string",
               "index": "not_analyzed",
               "include_in_all": false
            }
         }
      }
   }
}

Notice the use of the_all field. I included first_name and last_name in _all, but not ssn, and ssn is not analyzed at all since I want to do exact matches against it.

I indexed a couple of documents for illustration:

POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"first_name":"Erik","last_name":"Smith","ssn":"111112222"}
{"index":{"_id":2}}
{"first_name":"Bob","last_name":"Jones","ssn":"123456789"}

Then I can query for the partial names, and filter by the exact ssn:

POST /test_index/doc/_search
{
   "query": {
      "filtered": {
         "query": {
            "match": {
               "_all": {
                   "query": "eri smi",
                   "operator": "and"
               }
            }
         },
         "filter": {
            "term": {
               "ssn": "111112222"
            }
         }
      }
   }
}

And I get back what I'm expecting:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.8838835,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.8838835,
            "_source": {
               "first_name": "Erik",
               "last_name": "Smith",
               "ssn": "111112222"
            }
         }
      ]
   }
}

If you need to be able to do the search with a single query string (no filter), you could include ssn in the all field as well, but with this setup it will also match on partial strings (like 111112) so that may not be what you want.

If you only want to match prefixes (i.e., search terms that start at the beginning of the words), you should use edge ngrams.

I wrote a blog post about using ngrams which might help you out a little: http://blog.qbox.io/an-introduction-to-ngrams-in-elasticsearch

Here is the code I used for this answer. I tried a few different things, including the setup I posted here, and another inluding ssn in _all, but with edge ngrams. Hope this helps:

http://sense.qbox.io/gist/b6a31c929945ef96779c72c468303ea3bc87320f

like image 124
Sloan Ahrens Avatar answered Nov 15 '22 08:11

Sloan Ahrens