Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to change diacritic characters to non-diacritic ones [duplicate]

I've found a answer how to remove diacritic characters on stackoverflow, but could you please tell me if it is possible to change diacritic characters to non-diacritic ones?

Oh.. and I think about .NET (or other if not possible)

like image 528
Tom Smykowski Avatar asked Dec 01 '08 16:12

Tom Smykowski


People also ask

How do you change an accented character to a regular character?

replace(/[^a-z0-9]/gi,'') . However a more intuitive solution (at least for the user) would be to replace accented characters with their "plain" equivalent, e.g. turn á , á into a , and ç into c , etc.

How do you convert diacritics?

By using the function @[REPLACE-DIACRITICS(<value>)], you can convert any character containing diacritics to an ASCII equivalent.

How do I change the accented characters to regular characters in Excel?

Click Kutools > Text > Replace Accented Characters…, see screenshot: 3. In Replace Accented Characters dialog box, click the Select all button to select all replace rules, and click the Ok button to replace all accented characters.

How do I get rid of diacritic in Python?

We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.


1 Answers

Since no one has ever bothered to post the code to do this, here it is:

    // \p{Mn} or \p{Non_Spacing_Mark}: 
    //   a character intended to be combined with another 
    //   character without taking up extra space 
    //   (e.g. accents, umlauts, etc.). 
    private readonly static Regex nonSpacingMarkRegex = 
        new Regex(@"\p{Mn}", RegexOptions.Compiled);

    public static string RemoveDiacritics(string text)
    {
        if (text == null)
            return string.Empty;

        var normalizedText = 
            text.Normalize(NormalizationForm.FormD);

        return nonSpacingMarkRegex.Replace(normalizedText, string.Empty);
    }

Note: a big reason for needing to do this is when you are integrating to a 3rd party system that only does ascii, but your data is in unicode. This is common. Your options are basically: remove accented characters, or attempt to remove accents from the accented characters to attempt to preserve as much as you can of the original input. Obviously, this is not a perfect solution but it is 80% better than simply removing any character above ascii 127.

like image 133
dan Avatar answered Oct 11 '22 03:10

dan