Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can there be 2 different UTF-8 encodings for the same character?

I'm writing an application that needs to transcode its input from UTF-8 to ISO-8859-1 (Latin 1).

All works fine, except I sometimes get strange encodings for some umlaut characters. For example the Latin 1 E with 2 dots (0xEB) usually comes as UTF-8 0xC3 0xAB, but sometimes also as 0xC3 0x83 0xC2 0xAB.

This happened a number of times from different sources and noting that first and last characters match what I expect, could there be an encoding rule that my library doesn't know about ?

like image 260
Gene Vincent Avatar asked May 18 '12 11:05

Gene Vincent


People also ask

Can UTF-8 store a character in more than one byte?

Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8. These code points are the same as those in ASCII CCSID 367. Any other character is encoded with more than 1 byte in UTF-8.

How many possible UTF-8 characters are there?

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.

What is the difference between UTF-8 and UTF-8?

UTF-8 is a valid IANA character set name, whereas utf8 is not. It's not even a valid alias. it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.

What are the most common character encodings formats?

The most common encoding schemes are : UTF-8. UTF-16. UTF-32.


2 Answers

Certain Unicode characters can be represented in a composed and decomposed form. For example, the German umlaut-u ü can be represented either by the single character ü or by u followed by ¨, which a text renderer would then combine.

See the Wikipedia article on Unicode equivalence for gory details.

Unicode libraries thus usually provide methods or functions to normalize strings into one form or another so you can compare them.

like image 78
DarkDust Avatar answered Sep 18 '22 13:09

DarkDust


$ "\xC3\x83\xC2\xAB"
ë
$ use Encode

$ decode 'UTF-8', "\xC3\x83\xC2\xAB"
ë

You have double-encoded UTF-8. Encode::Repair is one way to deal with that.

like image 37
daxim Avatar answered Sep 17 '22 13:09

daxim