Unicode: English characters above code point 127

Question

I'm giving a tech talk about Unicode and encoding in my company, in which I'm trying to make the point that strings are always encoded, and developers should never carelessly assume that everything is 0-127 ASCII.

I have numerous examples of problems caused by mis-encoded text, but I didn't find any example of simple English text with numbers that's encoded above Unicode code point 127.

The basic English alphabet is mapped in Unicode to the same numerical value as the plain old ASCII: The range A-Z is mapped to [65-90] (or [0x41-0x5a] in hex), and [a-z] is mapped to [97-122] (hex [0x61-0x7a]).

Does the English alphabet appear elsewhere in the code charts? I do not mean circumflex letters or other Latin variants, just the plain English alphabet.

Michael Madsen · Accepted Answer

CJK characters are generally monospaced in all fonts, since that's how those languages tend to be written.

When mixing CJK and English characters, however, you run into a problem: ASCII characters do not in general have the width of a CJK character. This means that if you use ASCII, you lose the monospaced property - which may not always be desirable.

For this purpose, ｆｕｌｌｗｉｄｔｈ　ｃｈａｒａｃｔｅｒｓ (U+FF00-FFEE, Wikipedia, Unicode code chart) may be used in place of "regular" characters. These have the property that they have the same width as a single CJK character.

Note, however, that fullwidth characters are virtually never used outside of a CJK context, and even in those contexts, plain ASCII is frequently used as well, when monospacing is considered unimportant.

Unicode: English characters above code point 127

Tags:

character-encoding

unicode

Adam Matan

1 Answers

Michael Madsen

Recent Activity

Donate For Us

Unicode: English characters above code point 127

Tags:

character-encoding

unicode

Adam Matan

1 Answers

Michael Madsen

Related questions

Recent Activity

Donate For Us