What is the correct stored value for umlaut "ü" in Perl?

Question

I would like to deliver UTF-8 websites with Perl directly. I ran into several encoding issues because the source data is not completely stored in UTF-8. Due to a debugging session for the encoding issues I discovered two different representations for the German umlaut ü. Which one is the correct stored value with Perl?

\xFC, which is the Unicode position U+00FC for ü
0xC3 0xBC, which is the UTF-8 hex representation for ü

If there isn't any difference then why is Perl storing umlauts in different representations and does not store it in either the Unicode position or the UTF-8 hex representation.

Unicode/UTF-8 character table reference

ikegami · Accepted Answer

Use Encoding::FixLatin's fix_latin.

$ perl -MEncoding::FixLatin=fix_latin -MEncode=encode_utf8 \
   -E'say sprintf "%v02X", encode_utf8(fix_latin("\xFC\xC3\xBC"))'
C3.BC.C3.BC

Internally, it's best to work with Unicode. Decode inputs, encode outputs. You likely got the mix forgetting to encode an output.

tripleee · Answer

There is no "correct", they are different representations. Generally speaking, it would probably be better to settle on Unicode and printing it out as UTF-8, but the main complication is really to know exactly what you have at each step of processing; if you can use UTF-8 reliably throughout, maybe that's simpler in your case.

What is the correct stored value for umlaut "ü" in Perl?

Tags:

unicode

utf-8

perl

diacritics

burnersk

2 Answers

ikegami

tripleee

Recent Activity

Donate For Us

What is the correct stored value for umlaut "ü" in Perl?

Tags:

unicode

utf-8

perl

diacritics

burnersk

2 Answers

ikegami

tripleee

Related questions

Recent Activity

Donate For Us