Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting encoding error when using hash keys to write xml files with XML::LibXML

This question is related to this question: Hash keys encoding: Why do I get here with Devel::Peek::Dump two different results?
When I uncomment the # utf8::upgrade( $name ); line or comment out the $hash{'müller'} = 'magenta'; line it works.

#!/usr/bin/env perl
use warnings;
use 5.014;
use utf8;
binmode STDOUT, ':encoding(utf-8)';
use XML::LibXML;

# Hash read in from a file:
# ... 
my %hash = ( 'müller' => 'green', 'schneider' => 'blue', 'bäcker' => 'red' );
# ...

# change or add something
$hash{'müller'} = 'magenta';

# writing Hash to xml file
my $doc = XML::LibXML::Document->new('1.0', 'UTF-8' );
my $root = $doc->createElement( 'my_test' );

for my $name ( keys %hash ) {
    # utf8::upgrade( $name );
    my $tag = $doc->createElement( 'item' );
    $tag->setAttribute( 'name' => $name );
    my $tag_color = $doc->createElement( 'color' );
    $tag_color->appendTextNode( $hash{$name} );
    $tag->appendChild( $tag_color );
    $root->appendChild( $tag );
}
$doc->setDocumentElement($root);
say $doc->serialize( 1 );
$doc->toFile( 'my_test.xml', 1 );

Output:

error : string is not in UTF-8  
encoding error : output conversion failed due to conv error, bytes 0xFC 0x6C 0x6C 0x65  
I/O error : encoder error  
<?xml version="1.0" encoding="ISO-8859-1"?>  
<my_test>  
  <item name="m    
i18n error : output conversion failed due to conv error, bytes 0xFC 0x6C 0x6C 0x65
I/O error : encoder error
like image 586
sid_com Avatar asked Dec 09 '11 10:12

sid_com


2 Answers

According to XML::LibXML, whether 'müller' eq 'müller' is true or false depends on how the strings have been stored internally. That's a bug. Specifically, assigning meaning to the UTF8 flag is known as "The Unicode Bug", and XML::LibXML is documented to do exactly that in the "encodings support" section of this page.

The bug is known, but it can't be fixed cleanly for backwards compatibility reasons. Perl provides two tools to work around instances of The Unicode Bug:

utf8::upgrade( $sv );    # Switch to the UTF8=1 storage format
utf8::downgrade( $sv );  # Switch to the UTF8=0 storage format

The former would be be the appropriate tool to use here.

sub _up { my ($s) = @_; utf8::ugprade($s); $s }
$tag_color->appendTextNode( _up $hash{$name} );

Note: You can use utf8::upgrade even if you don't do use utf8;. Only use use utf8; if your source code is UTF-8.

like image 190
ikegami Avatar answered Sep 21 '22 16:09

ikegami


I get the error if I save your script as iso-8859-1. If I save it as utf-8, it works.

like image 26
choroba Avatar answered Sep 21 '22 16:09

choroba