Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correcting the XML encoding

I have a xml with encoding tag set to 'utf-8'. But, it is actually iso-8859-1.

Programatically, how do I detect this in perl and python? and how do I decode with a different coding?

In perl, I tried

$xml = decode('iso-8859-1',$file)

but, this does not work.

like image 833
vkris Avatar asked Nov 19 '25 22:11

vkris


1 Answers

Miscoding is notoriously tricky to detect, as random binary data often represents valid strings in many many encodings.

In Perl, the easiest thing you could try would be to attempt to decode it as utf-8 and check for failures. (it only works this way round; a utf-8 encoded western-language document is almost always a valid iso-8859-1 document as well)

my $xml = eval { decode_utf8( $file, FB_CROAK ) };
if ( $@ ) { is_probably_iso-8859-1_instead }

Now you've detected the problem, you've got to work around it. This will most likely depend on the parser library you're using, but some generics ought to apply.

If there's no XML declaration or MIME-type, the Perl native encoding will be used, so the code you copied should do the trick.

If there's a mistaken XML declaration, you could either override it using any facility your XML decoding library provides, or just replace it manually before handing it over.

# assuming it's on line 1:
$contents =~ s/.*/<?xml version="1.0" encoding="ISO-8859-1"?>/;
like image 181
JB. Avatar answered Nov 21 '25 14:11

JB.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!