Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongo query to ignore non English characters

I have a mongo collection that stores city/country data in multiple languages. For example, the following query:

db.cities_database.find({ "name.pl.country": "Węgry" }).pretty().limit(10);

Returns data in the following format:

[
  {
    _id: ObjectId('67331d2a9566994a18c505aa'),
    geoname_id_city: 714073,
    latitude: 46.91667,
    longitude: 21.26667,
    geohash: 'u2r4guvvmm4m',
    country_code: 'HU',
    population: 7494,
    estimated_radius: 400,
    feature_code: 'PPL',
    name: {
      pl: { city: 'Veszto', admin1: null, country: 'Węgry' },
      ascii: { city: 'veszto', admin1: null, country: null },
      lt: { city: 'Veszto', admin1: null, country: 'Vengrija' },
      ru: { city: 'Veszto', admin1: null, country: 'Венгрия' },
      hu: { city: 'Veszto', admin1: null, country: 'Magyarország' },
      en: { city: 'Veszto', admin1: null, country: 'Hungary' },
      fr: { city: 'Veszto', admin1: null, country: 'Hongrie' }
    }
  }
...
]

I want to be able to use the same query while using English only characters, so for this example I'd like to query by "name.pl.country": "Wegry" (Instead character ę I'd like Mongo to treat it as e while performing this query).

Is it possible to achieve this?

So far I tried using collation like this:

db.cities_database.find({ "name.pl.country": "Wegry" }).collation({ locale: "pl", strength: 1 }).pretty().limit(10);

but this query doesn't return anything.

like image 926
Sebastian Meckovski Avatar asked Oct 23 '25 16:10

Sebastian Meckovski


2 Answers

I have no knowledge in Polish and I don't know the difference between e and ę. But if you use MongoDB Altas, you can set up a customAnalyzer with icuFolding to perform diacritics-insensitive search.

The index:

{
  "analyzer": "diacriticFolder",
  "mappings": {
    "fields": {
      "name": {
        "type": "document",
        "fields": {
          "pl": {
            "type": "document",
            "fields": {
              "country": {
                "analyzer": "diacriticFolder",
                "type": "string"
              }
            }
          }
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "diacriticFolder",
      "charFilters": [],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "icuFolding"
        }
      ]
    }
  ]
}

$search query:

[
  {
    $search: {
      "text": {
        "query": "Wegry",
        "path": "name.pl.country"
      }
    }
  }
]

MongoDB Atlas search playground

like image 100
ray Avatar answered Oct 26 '25 10:10

ray


I think that's the way how the polish collation is defined, see Polish CLDR chart.

ę Ę are black, I guess that means "must match exactly". Other characters (e.g. é É è È ê Ê ë Ë) are grey, for them it works:

db.collection.insertMany([
   { codepoint: 'U+00EBU', name: 'Latin Small Letter E with Diaeresis', char: 'ë' },
   { codepoint: 'U+0119', name: 'Latin Small Letter E with Ogonek', char: 'ę' },
   { codepoint: 'U+0065', name: 'Latin Small Letter E', char: 'e' }
])

When you query them it gives

db.collection.find({ char: "ë" }).collation({ locale: "pl", strength: 1 })
[
  { name: 'Latin Small Letter E with Diaeresis', char: 'ë' },
  { name: 'Latin Small Letter E', char: 'e' }
]

db.collection.find({ char: "ę" }).collation({ locale: "pl", strength: 1 })
[
  { name: 'Latin Small Letter E with Ogonek', char: 'ę' }
]

db.collection.find({ char: "e" }).collation({ locale: "pl", strength: 1 })
[
  { name: 'Latin Small Letter E with Diaeresis', char: 'ë' },
  { name: 'Latin Small Letter E', char: 'e' }
]

Maybe you are looking for

db.cities_database.find({ "name.pl.country": "Wegry" }).collation({ locale: "en_US_POSIX", strength: 1 })
like image 37
Wernfried Domscheit Avatar answered Oct 26 '25 10:10

Wernfried Domscheit



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!