I have a mongo collection that stores city/country data in multiple languages. For example, the following query:
db.cities_database.find({ "name.pl.country": "Węgry" }).pretty().limit(10);
Returns data in the following format:
[
{
_id: ObjectId('67331d2a9566994a18c505aa'),
geoname_id_city: 714073,
latitude: 46.91667,
longitude: 21.26667,
geohash: 'u2r4guvvmm4m',
country_code: 'HU',
population: 7494,
estimated_radius: 400,
feature_code: 'PPL',
name: {
pl: { city: 'Veszto', admin1: null, country: 'Węgry' },
ascii: { city: 'veszto', admin1: null, country: null },
lt: { city: 'Veszto', admin1: null, country: 'Vengrija' },
ru: { city: 'Veszto', admin1: null, country: 'Венгрия' },
hu: { city: 'Veszto', admin1: null, country: 'Magyarország' },
en: { city: 'Veszto', admin1: null, country: 'Hungary' },
fr: { city: 'Veszto', admin1: null, country: 'Hongrie' }
}
}
...
]
I want to be able to use the same query while using English only characters, so for this example I'd like to query by "name.pl.country": "Wegry" (Instead character ę I'd like Mongo to treat it as e while performing this query).
Is it possible to achieve this?
So far I tried using collation like this:
db.cities_database.find({ "name.pl.country": "Wegry" }).collation({ locale: "pl", strength: 1 }).pretty().limit(10);
but this query doesn't return anything.
I have no knowledge in Polish and I don't know the difference between e and ę. But if you use MongoDB Altas, you can set up a customAnalyzer with icuFolding to perform diacritics-insensitive search.
The index:
{
"analyzer": "diacriticFolder",
"mappings": {
"fields": {
"name": {
"type": "document",
"fields": {
"pl": {
"type": "document",
"fields": {
"country": {
"analyzer": "diacriticFolder",
"type": "string"
}
}
}
}
}
}
},
"analyzers": [
{
"name": "diacriticFolder",
"charFilters": [],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": [
{
"type": "icuFolding"
}
]
}
]
}
$search query:
[
{
$search: {
"text": {
"query": "Wegry",
"path": "name.pl.country"
}
}
}
]
MongoDB Atlas search playground
I think that's the way how the polish collation is defined, see Polish CLDR chart.
ę Ę are black, I guess that means "must match exactly". Other characters (e.g. é É è È ê Ê ë Ë) are grey, for them it works:
db.collection.insertMany([
{ codepoint: 'U+00EBU', name: 'Latin Small Letter E with Diaeresis', char: 'ë' },
{ codepoint: 'U+0119', name: 'Latin Small Letter E with Ogonek', char: 'ę' },
{ codepoint: 'U+0065', name: 'Latin Small Letter E', char: 'e' }
])
When you query them it gives
db.collection.find({ char: "ë" }).collation({ locale: "pl", strength: 1 })
[
{ name: 'Latin Small Letter E with Diaeresis', char: 'ë' },
{ name: 'Latin Small Letter E', char: 'e' }
]
db.collection.find({ char: "ę" }).collation({ locale: "pl", strength: 1 })
[
{ name: 'Latin Small Letter E with Ogonek', char: 'ę' }
]
db.collection.find({ char: "e" }).collation({ locale: "pl", strength: 1 })
[
{ name: 'Latin Small Letter E with Diaeresis', char: 'ë' },
{ name: 'Latin Small Letter E', char: 'e' }
]
Maybe you are looking for
db.cities_database.find({ "name.pl.country": "Wegry" }).collation({ locale: "en_US_POSIX", strength: 1 })
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With