Can somebody please provide some sample code to strip diacritical marks (i.e., replace characters having accents, umlauts, etc., with their unaccented, unumlauted, etc., character equivalents, e.g., every accented é
would become a plain ASCII e
) from a UnicodeString
using the ICU library in C++? E.g.:
UnicodeString strip_diacritics( UnicodeString const &s ) {
UnicodeString result;
// ...
return result;
}
Assume that s
has already been normalized. Thanks.
ICU lets you transliterate a string using a specific rule. My rule is NFD; [:M:] Remove; NFC
: decompose, remove diacritics, recompose. The following code takes an UTF-8 std::string
as an input and returns another UTF-8 std::string
:
#include <unicode/utypes.h>
#include <unicode/unistr.h>
#include <unicode/translit.h>
std::string desaxUTF8(const std::string& str) {
// UTF-8 std::string -> UTF-16 UnicodeString
UnicodeString source = UnicodeString::fromUTF8(StringPiece(str));
// Transliterate UTF-16 UnicodeString
UErrorCode status = U_ZERO_ERROR;
Transliterator *accentsConverter = Transliterator::createInstance(
"NFD; [:M:] Remove; NFC", UTRANS_FORWARD, status);
accentsConverter->transliterate(source);
// TODO: handle errors with status
// UTF-16 UnicodeString -> UTF-8 std::string
std::string result;
source.toUTF8String(result);
return result;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With