Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch: sorting Spanish double names alphabetically

I am doing an Elasticsearch query and I want the results ordered alphabetically by last name. My problem: the last names are all Spanish double names, and ES doesn't order them the way I would like it. I would prefer the order to be:

Batres Rivera
Batrín Chojoj
Fion Morales
Lopez Giron
Martinez Castellanos
Milán Casanova

This is my query:

{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "Last Name": {
        "order": "asc"
      }
    }
  ]
}

The order that I get with this is:

Batres Rivera
Batrín Chojoj
Milán Casanova
Martinez Castellanos
Fion Morales
Lopez Giron

So it is not sorting by the first string, but by either of both (Batres, Batrín, Casanova, Castellanos, Fion, Giron).
If I try additionally

{
    "order": "asc",
    "mode": "max"
}

then I get:

Batrín Chojoj
Lopez Giron
Martinez Castellanos
Milán Casanova
Fion Morales
Batres Rivera

All the fields are indexed by default, I checked with

curl -XGET localhost/my_index/_mapping 

and I get back

my_index: {
    my_type: {
        properties: {
            FirstName: {
                type: string
            }LastName: {
                type: string
            }MiddleName: {
                type: string
            }
            ...
        }
    }
}

Does anyone know how to make the results to be ordered to be ordered alphabetically by the beginning string of the last name?

Thanks!

like image 280
morninlark Avatar asked Dec 08 '22 10:12

morninlark


1 Answers

The problem is that your LastName field is analyzed, so the string Batres Rivera is indexed as a multi-value field with two terms: batres and rivera. But this isn't like an ordered array, it's more like a "bag of values". So when you try to sort on the field, it chooses one of the terms (the min or max) and sorts on that.

What you need to do is to store the LastName as a single term (Batres Rivera) for sorting purposes, by mapping the field as

{ "type": "string", "index": "not_analyzed"}

Obviously you can't then use that field for search purposes: you wouldn't be able to search for rivera and match on that field.

The way to support both searching and sorting is to use multi-fields: ie index the same value in two ways, one for searching and one for sorting.

In 0.90.* the syntax for multi-fields is:

curl -XPUT "http://localhost:9200/my_index" -d'
{
   "mappings": {
      "my_type": {
         "properties": {
            "LastName": {
               "type": "multi_field",
               "fields": {
                  "LastName": {
                     "type": "string"
                  },
                  "raw": {
                     "type": "string",
                     "index": "not_analyzed"
                  }
               }
            }
         }
      }
   }
}'

In 1.0.* the multi_field type has been removed and now any core field type supports sub-fields as follows:

curl -XPUT "http://localhost:9200/my_index" -d'
{
   "mappings": {
      "my_type": {
         "properties": {
            "LastName": {
               "type": "string",
               "fields": {
                  "raw": {
                     "type": "string",
                     "index": "not_analyzed"
                  }
               }
            }
         }
      }
   }
}'

So you can use the LastName field for searching, and the LastName.raw field for sorting:

curl -XGET "http://localhost:9200/my_index/my_type/_search" -d'
{
   "query": {
      "match": {
         "LastName": "rivera"
      }
   },
   "sort": "LastName.raw"
}'

Language specific sorting

You should also look at using the ICU analysis plugin to sort using the Spanish sort order (or collation). This is a bit more complex but is worth using:

curl -XPUT "http://localhost:9200/my_index" -d'
{
   "settings": {
      "analysis": {
         "analyzer": {
            "folding": {
               "type": "custom",
               "tokenizer": "icu_tokenizer",
               "filter": [
                  "icu_folding"
               ]
            },
            "es_sorting": {
               "type": "custom",
               "tokenizer": "keyword",
               "filter": [
                  "lowercase",
                  "spanish"
               ]
            }
         },
         "filter": {
            "spanish": {
               "type": "icu_collation",
               "language": "es"
            }
         }
      }
   },
   "mappings": {
      "my_type": {
         "properties": {
            "LastName": {
               "type": "string",
               "analyzer": "folding", 
               "fields": {
                  "raw": {
                     "type": "string",
                     "analyzer": "es_sorting"
                  }
               }
            }
         }
      }
   }
}'

We create a folding analyzer which we'll use for the LastName field, which will analyze a string like Muñoz Rivera into the two terms munoz (without the ~) and rivera. So a user can search for munoz or muñoz and either will match.

Then we create the es_sorting analyzer which indexes the proper sort order for muñoz rivera (lowercased) in Spanish.

Searching would be done in the same way:

curl -XGET "http://localhost:9200/my_index/my_type/_search" -d'
{
   "query": {
      "match": {
         "LastName": "rivera"
      }
   },
   "sort": "LastName.raw"
}'
like image 107
DrTech Avatar answered Dec 11 '22 09:12

DrTech