Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use of collation in mongodb $regex

Since v3.4 collations are available for find operations, especially as matches for diacritic characters are concerned. While a find query with a definite value ($eq opeartor or corresponding construct) will match letters and correspondent diacritics, the same is not true if a $regex is used in order to achieve a match on a partial search string (a 'LIKE').

Is there a to make the $regex query use the collation the same way than the $eq query?

consider example collection testcoll:

{ "_id" : ObjectId("586b7a0163aff45945462bea"), "city" : "Antwerpen" }, 
{ "_id" : ObjectId("586b7a0663aff45945462beb"), "city" : "Antwërpen" }

this query will find both records

db.testcoll.find({city: 'antwerpen'}).collation({"locale" : "en_US", "strength" : 1});

the same query using a regex will not (finds the record with 'Antwerpen' only)

db.testcoll.find({city: /antwe/i}).collation({"locale" : "en_US", "strength" : 1});
like image 652
Freud Chicken Avatar asked Jan 03 '17 10:01

Freud Chicken


Video Answer


2 Answers

Documentation

Case insensitive regular expression queries generally cannot use indexes effectively. The $regex implementation is not collation-aware and is unable to utilize case-insensitive indexes.

like image 96
joseph Avatar answered Nov 09 '22 07:11

joseph


I face this same issue today and I searched the Internet like crazy trying to find a solution. Didn't find any. So I came up with my on solution, a little frankenstein that worked for me.

I created a function which removes all the special characters from a string and then replaces all the characters that could be special to the equivalent regexp that could be special. In the end I just add a "i" option to cover the capitalize strings in my DB.

export const convertStringToRegexp = (text: string) => {
  let regexp = '';
  const textNormalized = text
    .normalize('NFD')
    .replace(/[\u0300-\u036f]/g, '') // remove all accents
    .replace(/[|\\{}()[\]^$+*?.]/g, '\\$&') // remove all regexp reserved char
    .toLowerCase();

  regexp = textNormalized
    .replace(/a/g, '[a,á,à,ä,â,ã]')
    .replace(/e/g, '[e,é,ë,è,ê]')
    .replace(/i/g, '[i,í,ï,ì,î]')
    .replace(/o/g, '[o,ó,ö,ò,õ,ô]')
    .replace(/u/g, '[u,ü,ú,ù,û]')
    .replace(/c/g, '[c,ç]')
    .replace(/n/g, '[n,ñ]')
    .replace(/[ªº°]/g, '[ªº°]');
  return new RegExp(regexp, 'i'); // "i" -> ignore case
};

And in my find() method, I just use this function with $regex option, like this:

db.testcoll.find({city: {$regex: convertStringToRegexp('twerp')} })

/*
Output:
[
  { "_id" : ObjectId("586b7a0163aff45945462bea"), "city" : "Antwerpen" }, 
  { "_id" : ObjectId("586b7a0663aff45945462beb"), "city" : "Antwërpen" }
]
*/

I also create a .spec.ts file (using Chai) to test this function. Of course you could adapt to Jest.


describe('ConvertStringToRegexp', () => {
  it('should convert all "a" to regexp', () => {
    expect(convertStringToRegexp('TAÁdaáh!')).to.deep.equal(
      /t[a,á,à,ä,â,ã][a,á,à,ä,â,ã]d[a,á,à,ä,â,ã][a,á,à,ä,â,ã]h!/i
    );
  });
  it('should convert all "e" to regexp', () => {
    expect(convertStringToRegexp('MEÉeéh!')).to.deep.equal(
      /m[e,é,ë,è,ê][e,é,ë,è,ê][e,é,ë,è,ê][e,é,ë,è,ê]h!/i
    );
  });
  it('should convert all "i" to regexp', () => {
    expect(convertStringToRegexp('VÍIiishí!')).to.deep.equal(
      /v[i,í,ï,ì,î][i,í,ï,ì,î][i,í,ï,ì,î][i,í,ï,ì,î]sh[i,í,ï,ì,î]!/i
    );
  });
  it('should convert all "o" to regexp', () => {
    expect(convertStringToRegexp('ÓOoóhhhh!!!!')).to.deep.equal(
      /[o,ó,ö,ò,õ,ô][o,ó,ö,ò,õ,ô][o,ó,ö,ò,õ,ô][o,ó,ö,ò,õ,ô]hhhh!!!!/i
    );
  });
  it('should convert all "u" to regexp', () => {
    expect(convertStringToRegexp('ÚUhuuúll!')).to.deep.equal(
      /[u,ü,ú,ù,û][u,ü,ú,ù,û]h[u,ü,ú,ù,û][u,ü,ú,ù,û][u,ü,ú,ù,û]ll!/i
    );
  });
  it('should convert all "c" to regexp', () => {
    expect(convertStringToRegexp('Cacacacaca')).to.deep.equal(
      /[c,ç][a,á,à,ä,â,ã][c,ç][a,á,à,ä,â,ã][c,ç][a,á,à,ä,â,ã][c,ç][a,á,à,ä,â,ã][c,ç][a,á,à,ä,â,ã]/i
    );
  });
  it('should remove all special characters', () => {
    expect(
      convertStringToRegexp('hello 123 °º¶§∞¢£™·ª•*!@#$%^WORLD?.')
    ).to.deep.equal(
      /h[e,é,ë,è,ê]ll[o,ó,ö,ò,õ,ô] 123 [ªº°][ªº°]¶§∞¢£™·[ªº°]•\*!@#\$%\^w[o,ó,ö,ò,õ,ô]rld\?\./i
    );
  });
  it('should accept all regexp reserved characters', () => {
    expect(
      convertStringToRegexp('Olá [-[]{}()*+?.,\\o/^$|#s] Mundo! ')
    ).to.deep.equal(
      /* eslint-disable @typescript-eslint/no-explicit-any */
      /[o,ó,ö,ò,õ,ô]l[a,á,à,ä,â,ã] \[-\[\]\{\}\(\)\*\+\?\.,\\[o,ó,ö,ò,õ,ô]\/\^\$\|#s\] m[u,ü,ú,ù,û][n,ñ]d[o,ó,ö,ò,õ,ô]! /i
    );
  });
});

like image 25
Raphael Soares Avatar answered Nov 09 '22 09:11

Raphael Soares