Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting combining diacritics to simple utf

I have a problem when inserting a string to database due to some encoding issues.

String source is a external rss feed. In web browser it looks ok. Even in debugger the text appears to be ok. If I copy the strong to notedpad, the result is also ok.

accented

But in notepad++ was possible to see that string is using combining characters. If changing to ansii, both combined appears. e.g.

á is displayed as a´

(In notepad++ is is like having two chars, on over the other. I even can select ... half of the char)

enter image description here

I googled a lot and tried very different approach to this problem. I really want to find a clever way of convert string with combining diacritics to simple utf8 database compatible ones.

Any help? Thank you so much!

like image 898
Valid Fixed Avatar asked Jan 02 '14 18:01

Valid Fixed


1 Answers

This should work for you

output.Normalize(NormalizationForm.FormC)

This little test gave 3, 2, 3. The middle string is correctly combining A and it's diacritic into a single UTF-8 character

Console.WriteLine(Encoding.UTF8.GetByteCount(("A\u0302")));    
Console.WriteLine(Encoding.UTF8.GetByteCount(("A\u0302").Normalize(NormalizationForm.FormC)));
Console.WriteLine(Encoding.UTF8.GetByteCount(("T\u0302").Normalize(NormalizationForm.FormC)));
like image 82
noggin182 Avatar answered Sep 30 '22 19:09

noggin182