I want to remove accents and more generally diacritic marks from a string to initiate an accent-insensitive search. Based on some reading on Unicode character classes, I've come up with this:
QString unaccent(const QString s)
{
QString s2 = s.normalized(QString::NormalizationForm_D);
QString out;
for (int i=0,j=s2.length(); i<j; i++)
{
// strip diacritic marks
if (s2.at(i).category()!=QChar::Mark_NonSpacing &&
s2.at(i).category()!=QChar::Mark_SpacingCombining)
{
out.append(s2.at(i));
}
}
return out;
}
It appears to work reasonably well for latin-based languages, but I'm wondering about its adequacy on other alphabets: arabic, cyrillic, CJK... which I cannot test due to lack of cultural understanding of these.
Specifically I wish I'd know:
NormalizationForm_KD
or NormalizationForm_D
?Mark_NonSpacing
and Mark_SpacingCombining
categories or should it include more categories?QString unaccent(const QString s)
{
QString output(s.normalized(QString::NormalizationForm_D));
return output.replace(QRegExp("[^a-zA-Z\\s]"), "");
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With