How can I work with XML::XPath when some elements' names are not in English?
I use Strawberry Perl.
I get employees.xml
and train_xml.pl
from web, they work good.
But when When I add some Chinese characters, I get the following error:
Wide character in die at D:/Strawberry/perl/site/lib/XML/XPath/Parser.pm line 189.
Query: /employees/employee[@age="30"]/工作... ..............................^^^ Invalid query somewhere around here (I think)
How can I solve this?
employees.xml
:
<?xml version="1.0" encoding="utf-8" ?>
<employees>
<employee age="30">
<name>linux</name>
<country>US</country>
<工作>教师</工作>
</employee>
<employee age="10">
<name>mac</name>
<country>US</country>
</employee>
<employee age="20">
<name>windows</name>
<country>US</country>
</employee>
</employees>
train_xml.pl
:
use Encode;
use XML::XPath->new;
use utf8;
my $xp=XML::XPath->new(filename=>"employees.xml");
print $xp->findvalue('/employees/employee[@age="10"]/name'),"\n";
my $path1 = '/employees/employee[@age="30"]/工作';
print $xp->findvalue($path1),"\n";
You could use XML::LibXML:
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use open ':std', ':encoding(UTF-8)';
use feature qw( say );
use XML::LibXML qw( );
{
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($ARGV[0]);
say $doc->findvalue('/employees/employee[@age="10"]/name');
say $doc->findvalue('/employees/employee[@age="30"]/工作');
}
Output:
$ ./a a.xml
mac
教师
If you want to keep using the (buggy, slower, and far-less-widely used) XML::XPath, you can use the following:
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use open ':std', ':encoding(UTF-8)';
use feature qw( say );
use XML::XPath qw( );
{ # Monkeypatch XML::XPath.
package XML::XPath::Parser;
# Colon removed from these definitions.
my $NameStartCharClassBody = "a-zA-Z_\\xC0-\\xD6\\xD8-\\xF6\\xF8-\\x{2FF}\\x{370}-\\x{37D}\\x{37F}-\\x{1FFF}\\x{200C}-\\x{200D}\\x{2070}-\\x{218F}\\x{2C00}-\\x{2FEF}\\x{3001}-\\x{D7FF}\\x{F900}-\\x{FDCF}\\x{FDF0}-\\x{FFFD}\\x{10000}-\\x{EFFFF}";
my $NameCharClassBody = "${NameStartCharClassBody}\\-.0-9\\xB7\\x{300}-\\x{36F}\\x{203F}-\\x{2040}";
my $Name = "(?:[$NameStartCharClassBody][$NameCharClassBody]*)";
$NCName = $Name;
$QName = "$NCName(?::$NCName)?";
$NCWild = "${NCName}:\\*";
}
{
my $doc = XML::XPath->new(filename => $ARGV[0]);
say $doc->findvalue('/employees/employee[@age="10"]/name');
say $doc->findvalue('/employees/employee[@age="30"]/工作');
}
Output:
$ ./a a.xml
mac
教师
You should, always, without exception post the actual code you run, not gibberish like:
use XML::XPath->new;
Now, as for this issue, I am fairly certain this is caused by this line in XML/XPath/Parser.pm
:
$NCName = '([A-Za-z_][\w\\.\\-]*)';
which, for reasons with which I am not familiar, requires that the first character of an element be restricted to the set of English letters and _
. Here is a simple test case:
#!/usr/bin/env perl
use v5.14;
use strict;
use warnings;
use utf8;
use open qw(:std :encoding(UTF-8));
use XML::XPath;
my $xp = XML::XPath->new(ioref => \*DATA );
my $good_path = '/employees/employee[@age="30"]/yağcı';
my $bad_path = '/employees/employee[@age="30"]/şımarık';
say $xp->findvalue($good_path);
say $xp->findvalue($bad_path);
__DATA__
<?xml version="1.0" encoding="utf-8" ?>
<employees>
<employee age="30">
<şımarık>değil</şımarık>
<yağcı>değil</yağcı>
</employee>
</employees>
Output:
C:\...\> perl x.pl
değil
Query:
/employees/employee[@age="30"]/şımarık...
..............................^^^
Invalid query somewhere around here (I think)
If I change that pattern to:
$NCName = '(\w[\w\\.\\-]*)';
I get the output:
C:\...\> perl x.pl
değil
değil
and, using your original data, I get:
değil
教师
after making the appropriate changes.
That is not the correct pattern to use, but I did this to make sure that my hunch as to the cause was correct by making the smallest change possible. The correct specification is in the standard:
Name ::= NameStartChar (NameChar)*
NameStartChar ::= ":" | [A-Z] | "_" | [a-z] |
[#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] |
[#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] |
[#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
[#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 |
[#x0300-#x036F] | [#x203F-#x2040]
Issue opened.
The module has been patched. You can download version 1.41 or later test.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With