Can there be 2 different UTF-8 encodings for the same character?

Tags:

I'm writing an application that needs to transcode its input from UTF-8 to ISO-8859-1 (Latin 1).

All works fine, except I sometimes get strange encodings for some umlaut characters. For example the Latin 1 E with 2 dots (0xEB) usually comes as UTF-8 0xC3 0xAB, but sometimes also as 0xC3 0x83 0xC2 0xAB.

This happened a number of times from different sources and noting that first and last characters match what I expect, could there be an encoding rule that my library doesn't know about ?

260

asked May 18 '12 11:05

Gene Vincent

2 Answers

Certain Unicode characters can be represented in a composed and decomposed form. For example, the German umlaut-u ü can be represented either by the single character ü or by u followed by ¨, which a text renderer would then combine.

See the Wikipedia article on Unicode equivalence for gory details.

Unicode libraries thus usually provide methods or functions to normalize strings into one form or another so you can compare them.

answered Sep 18 '22 13:09

DarkDust

$ "\xC3\x83\xC2\xAB"
Ã«
$ use Encode

$ decode 'UTF-8', "\xC3\x83\xC2\xAB"
ë

You have double-encoded UTF-8. Encode::Repair is one way to deal with that.

answered Sep 17 '22 13:09

daxim

Related questions
                            
                                What is the difference between uppercase "-E" switch and lowercase "-e" in perl?
                            
                                Slow loops in Perl
                            
                                Die if anything is written to STDERR?
                            
                                Count number of files in a folder with Perl
                            
                                How can I resolve the warning "Use of assignment to $[ is deprecated"?
                            
                                How to convert char string to hex in perl
                            
                                When should you use a package variable vs a lexical variable (and what's the difference)?
                            
                                How can we catch side comments using Perl::Tidy or Perl::Critic?
                            
                                How can I use Perl's system call to spawn independent threads?
                            
                                How can I update values on the screen without clearing it in Perl?
                            
                                Why does Perl's sprintf not round floating point numbers correctly?
                            
                                Can I generate Excel files with native Excel charts on Linux?
                            
                                What concerns should I have if I use Smart::Comments in development code?
                            
                                Why does Perl replace my string with "1"?
                            
                                What does @_ -1 mean in Perl?
                            
                                Perl shift operator simple question
                            
                                How to insert text into mysql having quotes using perl
                            
                                What does {} mean in perl?
                            
                                Getting STDOUT, STDERR, and response code from external *nix command in perl
                            
                                How to free memory in Perl?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can there be 2 different UTF-8 encodings for the same character?

Tags:

character-encoding

utf-8

perl

Gene Vincent

People also ask

2 Answers

DarkDust

daxim

Recent Activity

Donate For Us