Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the correct stored value for umlaut "ü" in Perl?

I would like to deliver UTF-8 websites with Perl directly. I ran into several encoding issues because the source data is not completely stored in UTF-8. Due to a debugging session for the encoding issues I discovered two different representations for the German umlaut ü. Which one is the correct stored value with Perl?

  • \xFC, which is the Unicode position U+00FC for ü
  • 0xC3 0xBC, which is the UTF-8 hex representation for ü

If there isn't any difference then why is Perl storing umlauts in different representations and does not store it in either the Unicode position or the UTF-8 hex representation.

Unicode/UTF-8 character table reference

like image 693
burnersk Avatar asked Dec 01 '25 03:12

burnersk


2 Answers

Use Encoding::FixLatin's fix_latin.

$ perl -MEncoding::FixLatin=fix_latin -MEncode=encode_utf8 \
   -E'say sprintf "%v02X", encode_utf8(fix_latin("\xFC\xC3\xBC"))'
C3.BC.C3.BC

Internally, it's best to work with Unicode. Decode inputs, encode outputs. You likely got the mix forgetting to encode an output.

like image 107
ikegami Avatar answered Dec 03 '25 07:12

ikegami


There is no "correct", they are different representations. Generally speaking, it would probably be better to settle on Unicode and printing it out as UTF-8, but the main complication is really to know exactly what you have at each step of processing; if you can use UTF-8 reliably throughout, maybe that's simpler in your case.

like image 20
tripleee Avatar answered Dec 03 '25 05:12

tripleee



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!