Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I remove diacritics (accents) from a string in .NET?

I'm trying to convert some strings that are in French Canadian and basically, I'd like to be able to take out the French accent marks in the letters while keeping the letter. (E.g. convert é to e, so crème brûlée would become creme brulee)

What is the best method for achieving this?

like image 326
James Hall Avatar asked Oct 30 '08 02:10

James Hall


People also ask

How do I get rid of Diacritic in Python?

We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.

How do you change an accented character to a regular character?

replace(/[^a-z0-9]/gi,'') . However a more intuitive solution (at least for the user) would be to replace accented characters with their "plain" equivalent, e.g. turn á , á into a , and ç into c , etc.

How do I remove the accent from a string in Java?

Use java. text. Normalizer to handle this for you. This will separate all of the accent marks from the characters.

What's the dash over a letter called?

Diacritics, often loosely called `accents', are the various little dots and squiggles which, in many languages, are written above, below or on top of certain letters of the alphabet to indicate something about their pronunciation.


2 Answers

I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics: Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)

static string RemoveDiacritics(string text)  {     var normalizedString = text.Normalize(NormalizationForm.FormD);     var stringBuilder = new StringBuilder(capacity: normalizedString.Length);      for (int i = 0; i < normalizedString.Length; i++)     {         char c = normalizedString[i];         var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);         if (unicodeCategory != UnicodeCategory.NonSpacingMark)         {             stringBuilder.Append(c);         }     }      return stringBuilder         .ToString()         .Normalize(NormalizationForm.FormC); } 

Note that this is a followup to his earlier post: Stripping diacritics....

The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.

Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.

like image 57
Blair Conrad Avatar answered Oct 06 '22 11:10

Blair Conrad


this did the trick for me...

string accentedStr; byte[] tempBytes; tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr); string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes); 

quick&short!

like image 26
azrafe7 Avatar answered Oct 06 '22 12:10

azrafe7