Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing accents from a QString [duplicate]

Tags:

unicode

qt

I want to remove accents and more generally diacritic marks from a string to initiate an accent-insensitive search. Based on some reading on Unicode character classes, I've come up with this:

 QString unaccent(const QString s)
 {
   QString s2 = s.normalized(QString::NormalizationForm_D);
   QString out;
   for (int i=0,j=s2.length(); i<j; i++)
   {
     // strip diacritic marks
     if (s2.at(i).category()!=QChar::Mark_NonSpacing &&
         s2.at(i).category()!=QChar::Mark_SpacingCombining)
     {
          out.append(s2.at(i));
     }
   }
   return out;
 }

It appears to work reasonably well for latin-based languages, but I'm wondering about its adequacy on other alphabets: arabic, cyrillic, CJK... which I cannot test due to lack of cultural understanding of these.

Specifically I wish I'd know:

  1. What Unicode normalization form is better suited for this problem: NormalizationForm_KD or NormalizationForm_D?
  2. Is it sufficient to remove the characters belonging to Mark_NonSpacing and Mark_SpacingCombining categories or should it include more categories?
  3. Are there other improvements to the above code that would make it work as best as possible for all languages?
like image 203
Daniel Vérité Avatar asked Sep 05 '12 09:09

Daniel Vérité


Video Answer


1 Answers

QString unaccent(const QString s)
{
    QString output(s.normalized(QString::NormalizationForm_D));
    return output.replace(QRegExp("[^a-zA-Z\\s]"), "");
}
like image 115
Heitor Avatar answered Oct 12 '22 23:10

Heitor