Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I translate 8bit characters into 7bit characters? (i.e. Ü to U)

Tags:

I'm looking for pseudocode, or sample code, to convert higher bit ascii characters (like, Ü which is extended ascii 154) into U (which is ascii 85).

My initial guess is that since there are only about 25 ascii characters that are similar to 7bit ascii characters, a translation array would have to be used.

Let me know if you can think of anything else.

like image 206
Michael Pryor Avatar asked Sep 26 '08 16:09

Michael Pryor


People also ask

What is ASCII 7bit?

7-bit encoding is a reference to the Ascii character set — pronounced “Askey” and standing for “American Standard Code for Information Interchange” — which is a mapping of English alphabet characters, numbers and symbols to 7-bit numerical values in the range 0 to 127.

How many 8-bit ASCII characters are there?

Character sets used today in the US are generally 8-bit sets with 256 different characters, effectively doubling the ASCII set. One bit can have 2 possible states.

What is the 8th bit in ASCII used for?

Computer manufacturers agreed to use one code called the ASCII (American Standard Code for Information Interchange). ASCII is an 8-bit code. That is, it uses eight bits to represent a letter or a punctuation mark.

What is the decimal range of the 7 bit ASCII character set?

The table below shows a version of ASCII that uses 7 bits to code each character. The biggest number that can be held in 7-bits is 1111111 in binary (127 in decimal). Therefore 128 different characters can be represented in the ASCII character set (Using codes 0 to 127).


2 Answers

For .NET users the article in CodeProject (thanks to GvS's tip) does indeed answer the question more correctly than any other I've seen so far.

However the code in that article (in solution #1) is cumbersome. Here's a compact version:

// Based on http://www.codeproject.com/Articles/13503/Stripping-Accents-from-Latin-Characters-A-Foray-in private static string LatinToAscii(string inString) {     var newStringBuilder = new StringBuilder();     newStringBuilder.Append(inString.Normalize(NormalizationForm.FormKD)                                     .Where(x => x < 128)                                     .ToArray());     return newStringBuilder.ToString(); } 

To expand a bit on the answer, this method uses String.Normalize which:

Returns a new string whose textual value is the same as this string, but whose binary representation is in the specified Unicode normalization form.

Specifically in this case we use the NormalizationForm FormKD, described in those same MSDN docs as such:

FormKD - Indicates that a Unicode string is normalized using full compatibility decomposition.

For more information about unicode normalization forms, see Unicode Annex #15.

like image 177
sinelaw Avatar answered Oct 05 '22 04:10

sinelaw


Most languages have a standard way to replace accented characters with standard ASCII, but it depends on the language, and it often involves replacing a single accented character with two ASCII ones. e.g. in German ü becomes ue. So if you want to handle natural languages properly it's a lot more complicated than you think it is.

like image 23
Mark Baker Avatar answered Oct 05 '22 04:10

Mark Baker