I am extracting strings from an XML file, and even though it should be pure UTF-8, it is not. My idea was to
#!/usr/bin/perl
use warnings;
use strict;
use Encode qw(decode encode);
use Data::Dumper;
my $x = "m\x{e6}gtig";
my $y = "m\x{c3}\x{a6}gtig";
my $a = encode('UTF-8', $x);
my $b = encode('UTF-8', $y);
print Dumper $x;
print Dumper $y;
print Dumper $a;
print Dumper $b;
if ($x eq $y) { print "1\n"; }
if ($x eq $a) { print "2\n"; }
if ($a eq $y) { print "3\n"; }
if ($a eq $b) { print "4\n"; }
if ($x eq $b) { print "5\n"; }
if ($y eq $b) { print "6\n"; }
outputs
$VAR1 = 'm�gtig';
$VAR1 = 'mægtig';
$VAR1 = 'mægtig';
$VAR1 = 'mægtig';
3
under the idea that only a latin1 string would increase its length, but encoding an already UTF-8 also makes it longer. So I can't detect latin1 vs UTF-8 that way.
Question
I would like to end up with always UTF-8 string, but how can I detect if it is latin1 or UTF-8, so I only convert the latin1 string?
Being able to get a yes/no if a string is UTF-8 would be just as useful.
Due to some properties of UTF-8, it's very unlikely that text encoded using iso-8859-1 would be valid UTF-8 unless it decodes identically using both encodings[1].
As such, the solution is to try decoding it using UTF-8. If it fails, decode it using iso-8859-1 instead. Since decoding using iso-8859-1 is a no-op, I'll be skipping that step.
utf8:: implementation:
my $decoded_text = $utf8_or_latin1;
utf8::decode($decoded_text);
Encode:: implementation:
use Encode qw( decode_utf8 );
my $decoded_text =
eval { decode_utf8($utf8_or_latin1, Encode::FB_CROAK|Encode::LEAVE_SRC) }
// $utf8_or_latin1;
Now, you say you want UTF-8. UTF-8 is obtained from encoding decoded text.
utf8:: implementation:
my $utf8 = $decoded_text;
utf8::encode($utf8);
Encode:: implementation:
use Encode qw( encode_utf8 );
my $utf8 = encode_utf8($decoded_text);
Notes
Assuming the text is either valid UTF-8 or valid iso-8859-1, my solution would only guess wrong if all of the following are true:
(<80>..<9F> are unassigned or unprintable control characters, not sure which.)
In other words, that code is very reliable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With