Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Character Set Special Characters

  • Is iso-8859-1 a proper subset of utf-8?
  • What about iso-8859-n?
  • What about windows-1252?

If the answer is no to any of the above, what are the disjoint characters? I'm testing some logic that detects charsets and want to write tests to verify the detection is working properly.

like image 327
Sean Jezewski Avatar asked Apr 05 '12 01:04

Sean Jezewski


People also ask

What characters are special characters?

A special character is a character that is not an alphabetic or numeric character. Punctuation marks and other symbols are examples of special characters. Unlike alphanumeric characters, special characters may have multiple uses.

What is an example of a character set?

Examples of character sets include International EBCDIC, Latin 1, and Unicode. Character sets are chosen on the basis of the letters and symbols required. Character sets are referred to by a name or by an integer identifier called the coded character set identifier (CCSID).

How do I type special characters?

In your document, position the insertion point where you want the special character to appear. Press and hold down the ALT key while you type the four number Unicode value for the character. Note that NUM LOCK must be on, and you have to use the number pad keys to type the Unicode character value.

What is character set and its types?

In the C programming language, the character set refers to a set of all the valid characters that we can use in the source program for forming words, expressions, and numbers. The source character set contains all the characters that we want to use for the source program text.


1 Answers

Is iso-8859-1 a proper subset of utf-8?

The character reportoire of ISO-8859-1 (the first 256 characters of Unicode) is a proper subset of that of UTF-8 (every Unicode character).

However, the characters U+0080 to U+00FF are encoded differently in the two encodings.

  • ISO-8859-1 assigns each of these characters a single byte from 80 to FF.
  • UTF-8 encodes the same characters as two-byte sequences C2 80 to C3 BF.

What about iso-8859-n?

These are 15 different encodings that contain a total of 614 distinct characters. Some of these characters occur in multiple "parts" of ISO 8859, and some don't. You'll have to be more specific.

I see that your question is tagged ISO-8859-2. The characters that are in -2 that aren't in -1 are:

Ă㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŔŕŘřŚśŞşŠšŢţŤťŮůŰűŹźŻżŽžˇ˘˙˛˝

What about windows-1252?

Windows-1252 is just like ISO-8859-1 except that it replaces the rarely used control characters in the 0x80-0x9F range with printable characters. The characters that are in windows-1252 but not in ISO-8859-1 are:

ŒœŠšŸŽžƒˆ˜–—‘’‚“”„†‡•…‰‹›€™

like image 199
dan04 Avatar answered Sep 30 '22 07:09

dan04