I have been given an export from a MySQL database that seems to have had it's encoding muddled somewhat over time and contains a mix of <code>HTML char codes</code> such as <code>& uuml;</code> and more problematic characters representing the same letters such as <code>Ã¼</code> and <code>Ã&fnof;</code>. It is my task to to bring some consistency back to the file and get everything into the correct Latin characters, e.g. <code>ú</code> and <code>ó</code>. An example of the sort of string I am dealing with is <blockquote> DesinfektionslÃ&fnof;Â¶sungstÃ&fnof;Â¼cher fÃ&fnof;Â¼r FlÃ&fnof;Â¤chen </blockquote> Which should equate to <pre class="prettyprint"><code>50 Tattoo Desinfektionsl ö sungst ü cher f ü r Fl ä chen 50 Tattoo Desinfektionsl Ã&fnof;Â¶ sungst Ã&fnof;Â¼ cher f Ã&fnof;Â¼ r Fl Ã&fnof;Â¤ chen </code></pre> Is there a method available in C#/.Net 4.5 that would successfully re-encode the likes of <code>Ã¼</code> and <code>Ã&fnof;</code> to <code>UTF-8</code>? Else what approach would be advisable? Also is the paragraph character <code>¶</code> in the above example string an actual paragraph character or part of some other character combination? I have created a lookup table in the case of needing to do find and replace which is below, however I am unsure as to how complete it is. <pre class="prettyprint"><code>Ã&permil; -> É â€&oelig; -> " â€ -> " Ã&Dagger; -> Ç Ã&fnof; -> Ã Ã©, 'é Ã -> À Ãº -> ú â€¢ -> - Ã&tilde; -> Ø Ãµ -> õ Ã -> í Ã¢ -> â Ã£ -> ã Ãª -> ê Ã¡ -> á Ã© -> é Ã³ -> ó â€“ -> – Ã§ -> ç Âª -> ª Âº -> º Ã -> à </code></pre>

Well, first of all, as the data has been decoded using the wrong encoding, it's likely that some of the characters are impossible to recover. It looks like it's UTF-8 data that incorrectly decoded using an 8-bit encoding. There is no built in method to recover data like this, because it's not something that you normally do. There is no reliable way to decode the data, because it's already broken. What you can try, is to encode the data, and decode it using the wrong encoding again, just the other way around: <pre class="prettyprint"><code>byte[] data = Encoding.Default.GetBytes(input); string output = Encoding.UTF8.GetString(data); </code></pre> The <code>Encoding.Default</code> uses the current ANSI encoding for your system. You can try some different encodings there and see which one gives the best result.

The data is only partly unrecoverable due to Windows-1252 encoding having 5 unassigned slots. Some modifications of Windows-1252 fill these with control characters but those don't make it to posts in Stackoverflow. If modified Windows-1252 has been used you can fully recover as long as you don't lose the hidden control characters in copy pastes. There is also the non-breaking space character that is ignored or turned into a space usually with copypastes, but that's not an issue when you deal with bytes directly. The misencoding abuse this string has gone through is: <pre class="prettyprint"><code>UTF-8 -> Windows-1252 -> UTF-8 -> Windows-1252 </code></pre> To recover, here is an example: <pre class="prettyprint"><code>String a = "DesinfektionslÃ&fnof;Â¶sungstÃ&fnof;Â¼cher fÃ&fnof;Â¼r FlÃ&fnof;Â¤chen"; Encoding utf8 = Encoding.GetEncoding(65001); Encoding win1252 = Encoding.GetEncoding(1252); string result = utf8.GetString(win1252.GetBytes(utf8.GetString(win1252.GetBytes(a)))); Console.WriteLine(result); //Desinfektionslösungstücher für Flächen </code></pre>

Converting special charactes such as Ã¼ and Ãƒ back to their original, latin alphbet counterparts in C#

Tags:

c#

character-encoding

special-characters

mojibake

latin

I have been given an export from a MySQL database that seems to have had it's encoding muddled somewhat over time and contains a mix of HTML char codes such as & uuml; and more problematic characters representing the same letters such as Ã¼ and Ãƒ. It is my task to to bring some consistency back to the file and get everything into the correct Latin characters, e.g. ú and ó.

An example of the sort of string I am dealing with is

DesinfektionslÃƒÂ¶sungstÃƒÂ¼cher fÃƒÂ¼r FlÃƒÂ¤chen

Which should equate to

50 Tattoo Desinfektionsl ö    sungst ü    cher f ü    r Fl ä    chen 
50 Tattoo Desinfektionsl ÃƒÂ¶ sungst ÃƒÂ¼ cher f ÃƒÂ¼ r Fl ÃƒÂ¤ chen

Is there a method available in C#/.Net 4.5 that would successfully re-encode the likes of Ã¼ and Ãƒ to UTF-8?

Else what approach would be advisable?

Also is the paragraph character ¶ in the above example string an actual paragraph character or part of some other character combination?

I have created a lookup table in the case of needing to do find and replace which is below, however I am unsure as to how complete it is.

Ã‰ -> É
â€œ -> "
â€ -> "
Ã‡ -> Ç
Ãƒ -> Ã
Ã©, 'é
Ã  -> À
Ãº -> ú
â€¢ -> -
Ã˜ -> Ø
Ãµ -> õ
Ã -> í
Ã¢ -> â
Ã£ -> ã
Ãª -> ê
Ã¡ -> á
Ã© -> é
Ã³ -> ó
â€“ -> –
Ã§ -> ç
Âª -> ª
Âº -> º
Ã  -> à

301

asked Feb 20 '13 12:02

Gga

2 Answers

Well, first of all, as the data has been decoded using the wrong encoding, it's likely that some of the characters are impossible to recover. It looks like it's UTF-8 data that incorrectly decoded using an 8-bit encoding.

There is no built in method to recover data like this, because it's not something that you normally do. There is no reliable way to decode the data, because it's already broken.

What you can try, is to encode the data, and decode it using the wrong encoding again, just the other way around:

byte[] data = Encoding.Default.GetBytes(input);
string output = Encoding.UTF8.GetString(data);

The Encoding.Default uses the current ANSI encoding for your system. You can try some different encodings there and see which one gives the best result.

answered Sep 18 '22 23:09

Guffa

The data is only partly unrecoverable due to Windows-1252 encoding having 5 unassigned slots. Some modifications of Windows-1252 fill these with control characters but those don't make it to posts in Stackoverflow. If modified Windows-1252 has been used you can fully recover as long as you don't lose the hidden control characters in copy pastes.

There is also the non-breaking space character that is ignored or turned into a space usually with copypastes, but that's not an issue when you deal with bytes directly.

The misencoding abuse this string has gone through is:

UTF-8 -> Windows-1252 -> UTF-8 -> Windows-1252

To recover, here is an example:

String a = "DesinfektionslÃƒÂ¶sungstÃƒÂ¼cher fÃƒÂ¼r FlÃƒÂ¤chen";
Encoding utf8 = Encoding.GetEncoding(65001);
Encoding win1252 = Encoding.GetEncoding(1252);

string result = utf8.GetString(win1252.GetBytes(utf8.GetString(win1252.GetBytes(a))));

Console.WriteLine(result);
//Desinfektionslösungstücher für Flächen

answered Sep 20 '22 23:09

Esailija

Related questions
                            
                                Passing parameters to MVVM Command
                            
                                Compiling and running code at runtime in .NET Core 1.0
                            
                                Are these try/catch'es equivalent?
                            
                                Is it possible to bind to a ValueTuple field in WPF with C#7
                            
                                Return HTML from ASP.NET Web API ASP.NET Core 2 and get http status 406
                            
                                A generic singleton
                            
                                Why 3 threads for a basic single threaded c# console app?
                            
                                XmlSerializer Performance Issue when Specifying XmlRootAttribute
                            
                                Ternary operator associativity in C# - can I rely on it?
                            
                                What is the quickest way to remove one array of items from another?
                            
                                Return start and end of year given any year [duplicate]
                            
                                Is a ret instruction required in .NET applications?
                            
                                How to periodically flush dapper.net cache when used with SQL Server
                            
                                Loading Nested Entities / Collections with Entity Framework
                            
                                How to prevent System.Xml.XmlException: Invalid character in the given encoding
                            
                                C# inheritance. Derived class from Base class
                            
                                Expression Lambda versus Statement Lambda
                            
                                Html Agility Pack, SelectNodes from a node
                            
                                C#.net identify zip file
                            
                                Does it matter performance wise if there is an `else` after the first `return`?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With