I would like to deliver UTF-8 websites with Perl directly. I ran into several encoding issues because the source data is not completely stored in UTF-8. Due to a debugging session for the encoding issues I discovered two different representations for the German umlaut ü. Which one is the correct stored value with Perl?
\xFC, which is the Unicode position U+00FC for ü0xC3 0xBC, which is the UTF-8 hex representation for üIf there isn't any difference then why is Perl storing umlauts in different representations and does not store it in either the Unicode position or the UTF-8 hex representation.
Unicode/UTF-8 character table reference
Use Encoding::FixLatin's fix_latin.
$ perl -MEncoding::FixLatin=fix_latin -MEncode=encode_utf8 \
-E'say sprintf "%v02X", encode_utf8(fix_latin("\xFC\xC3\xBC"))'
C3.BC.C3.BC
Internally, it's best to work with Unicode. Decode inputs, encode outputs. You likely got the mix forgetting to encode an output.
There is no "correct", they are different representations. Generally speaking, it would probably be better to settle on Unicode and printing it out as UTF-8, but the main complication is really to know exactly what you have at each step of processing; if you can use UTF-8 reliably throughout, maybe that's simpler in your case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With