I'm writing an application that needs to transcode its input from UTF-8 to ISO-8859-1 (Latin 1).
All works fine, except I sometimes get strange encodings for some umlaut characters. For example the Latin 1 E with 2 dots (0xEB) usually comes as UTF-8 0xC3 0xAB, but sometimes also as 0xC3 0x83 0xC2 0xAB.
This happened a number of times from different sources and noting that first and last characters match what I expect, could there be an encoding rule that my library doesn't know about ?
Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8. These code points are the same as those in ASCII CCSID 367. Any other character is encoded with more than 1 byte in UTF-8.
UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.
UTF-8 is a valid IANA character set name, whereas utf8 is not. It's not even a valid alias. it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.
The most common encoding schemes are : UTF-8. UTF-16. UTF-32.
Certain Unicode characters can be represented in a composed and decomposed form. For example, the German umlaut-u ü
can be represented either by the single character ü
or by u
followed by ¨
, which a text renderer would then combine.
See the Wikipedia article on Unicode equivalence for gory details.
Unicode libraries thus usually provide methods or functions to normalize strings into one form or another so you can compare them.
$ "\xC3\x83\xC2\xAB"
ë
$ use Encode
$ decode 'UTF-8', "\xC3\x83\xC2\xAB"
ë
You have double-encoded UTF-8. Encode::Repair is one way to deal with that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With