MongoDB diacriticInSensitive search not showing all accented (words with diacritic mark) rows as expected and vice-versa

Question

I have a document collection with following structure

uid, name

With a Index

db.Collection.createIndex({name: "text"})

It contains following data

1, iphone
2, iphóne
3, iphonë
4, iphónë

When I am doing text search for iphone I am getting only two records, which is unexpected

actual output
--------------
1, iphone
2, iphóne

If I search for iphonë

db.Collection.find( { $text: { $search: "iphonë"} } );

I am getting
---------------------
3, iphonë
4, iphónë

But Actually I am expecting following output

db.Collection.find( { $text: { $search: "iphone"} } );
db.Collection.find( { $text: { $search: "iphónë"} } );

    Expected output
    ------------------
    1, iphone
    2, iphóne
    3, iphonë
    4, iphónë

am I missing something here? How can I get above expected outputs, with search of iphone or iphónë?

felix · Accepted Answer

Since mongodb 3.2, text indexes are diacritic insensitive:

With version 3, text index is diacritic insensitive. That is, the index does not distinguish between characters that contain diacritical marks and their non-marked counterpart, such as é, ê, and e. More specifically, the text index strips the characters categorized as diacritics in Unicode 8.0 Character Database Prop List.

So the following query should work:

db.Collection.find( { $text: { $search: "iphone"} } );
db.Collection.find( { name: { $regex: "iphone"} } );

but it looks like there is a bug with dieresis ( ¨ ), even if it's caterorized as diacritic in unicode 8.0 list (issue on JIRA: SERVER-29918 )

Solution

since mongodb 3.4 you can use collation which allows you to perform this kind of query :

for example, to get your expected output, run the following query:

db.Collection.find({name: "iphone"}).collation({locale: "en", strength: 1})

this will output:

{ "_id" : 1, "name" : "iphone" }
{ "_id" : 2, "name" : "iphône" }
{ "_id" : 3, "name" : "iphonë" }
{ "_id" : 4, "name" : "iphônë" }

in the collation, strength is the level of comparaison to perform

1 : base character only
2 : diacritic sensitive
3 : case sensitive + diacritic sensitive

MongoDB diacriticInSensitive search not showing all accented (words with diacritic mark) rows as expected and vice-versa

Tags:

search

mongodb

diacritics

text-search

accent-insensitive

vikram eklare

1 Answers

Solution

felix

Recent Activity

Donate For Us

MongoDB diacriticInSensitive search not showing all accented (words with diacritic mark) rows as expected and vice-versa

Tags:

search

mongodb

diacritics

text-search

accent-insensitive

vikram eklare

1 Answers

Solution

felix

Related questions

Recent Activity

Donate For Us