Perl LWP::UserAgent mishandling UTF-8 response

Tags:

When I use LWP::UserAgent to retrieve content encoded in UTF-8 it seems LWP::UserAgent doesn't handle the encoding correctly.

Here's the output after setting the Command Prompt window to Unicode by the command chcp 65001 Note that this initially gives the appearance that all is well, but I think it's just the shell reassembling bytes and decoding UTF-8, From the other output you can see that perl itself is not handling wide characters correctly.

C:\>perl getutf8.pl
======================================================================
HTTP/1.1 200 OK
Connection: close
Date: Fri, 31 Dec 2010 19:24:04 GMT
Accept-Ranges: bytes
Server: Apache/2.2.8 (Win32) PHP/5.2.6
Content-Length: 75
Content-Type: application/xml; charset=utf-8
Last-Modified: Fri, 31 Dec 2010 19:20:18 GMT
Client-Date: Fri, 31 Dec 2010 19:24:04 GMT
Client-Peer: 127.0.0.1:80
Client-Response-Num: 1

<?xml version="1.0" encoding="UTF-8"?>
<name>Budějovický Budvar</name>

======================================================================
response content length is 33

....v....1....v....2....v....3....v....4
<name>Budějovický Budvar</name>

. . . . v . . . . 1 . . . . v . . . . 2 . . . . v . . . . 3 . . . .
3c6e616d653e427564c49b6a6f7669636bc3bd204275647661723c2f6e616d653e
< n a m e > B u d � � j o v i c k � �   B u d v a r < / n a m e >

Above you can see the payload length is 31 characters but Perl thinks it is 33. For confirmation, in the hex, we can see that the UTF-8 sequences c49b and c3bd are being interpreted as four separate characters and not as two Unicode characters.

Here's the code

#!perl
use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new();
my $response = $ua->get('http://localhost/Bud.xml');
if (! $response->is_success) { die $response->status_line; }

print '='x70,"\n",$response->as_string(), '='x70,"\n";

my $r = $response->decoded_content((charset => 'UTF-8')); 
$/ = "\x0d\x0a"; # seems to be \x0a otherwise!
chomp($r);

# Remove any xml prologue
$r =~ s/^<\?.*\?>\x0d\x0a//;

print "Response content length is ", length($r), "\n\n";
print "....v....1....v....2....v....3....v....4\n";
print $r,"\n";

print ". . . . v . . . . 1 . . . . v . . . . 2 . . . . v . . . . 3 . . . . \n";
print unpack("H*", $r), "\n";
print join(" ", split("", $r)), "\n";

Note that Bud.xml is UTF-8 encoded without a BOM.

How can I persuade LWP::UserAgent to do the right thing?

P.S. Ultimately I want to translate the Unicode data into an ASCII encoding, even if it means replacing each non-ASCII character with one question mark or other marker.

Update 1

I have accepted Ysth's "upgrade" answer - because I know it is the right thing to do when possible. However there is a work around to fix up the data into a well formed Perl Unicode string.

$r = decode("utf8", $r);

Update 2

My data gets fed to a non-Perl application that displays the data using Code Page 437 to Putty/Reflection/Teraterm terminals at many locations. The app is currently displaying something like:

Bud├ä┬øjovick├â┬¢ Budvar

I am going to use ($r = decode("UTF-8", $r)) =~ s/[\x80-\x{FFFF}]/\xFE/g; to get the app to display:

Bud■jovick■ Budvar

Moving away from CP437 would be a major job, so that is not going to happen in the short to medium term.

Update 3

CPAN has some interesting Unicode modules such as:

Text::Unidecode
Unicode::Map8
Unicode::Map
Unicode::Escape
Unicode::Transliterate

Text::Unidecode translated "Budějovický Budvar" into "Budejovicky Budvar" - which didn't seem to me a particularly impressive attempt at a phonetic transliteration but then I don't speak Czech. English speakers might prefer it to "Bud■jovick■ Budvar" though.

486

asked Dec 31 '10 19:12

RedGrittyBrick

1 Answers

Upgrade to a newer libwwwperl. The old version you are using only honored the charset argument to decoded_content for text/* content types; the newer version also does so for application/xml or anything ending +xml.

130

answered Nov 15 '22 08:11

ysth

Related questions
                            
                                Building multiple backends for Raku fails
                            
                                Replace MULTIPLE first keys of multi-dimensional Hash in perl
                            
                                Perl Globbing a Variable stops on first match
                            
                                Is there a way to have dependencies defined by OS in a CPAN distribution?
                            
                                Monitoring Windows directory size
                            
                                How do I track down a mod_perl memory leak?
                            
                                Get Link Speed - Win32_PerfRawData_Tcpip_NetworkInterface
                            
                                How do I fix "perl is not recognized" on Windows?
                            
                                Does using undef as hash values save any memory in Perl?
                            
                                running BLAST (bl2seq) without creating sequence files
                            
                                Java equivalent of Perl's s/// operator?
                            
                                How do I group my package imports into a single custom package?
                            
                                Getting list of fields back from 'use fields' pragma?
                            
                                How can I generate all possible permutations from a Perl regular expression?
                            
                                Which tool should I use for finding out my memory allocation in Perl?
                            
                                Lexically importing useful functions in a big script
                            
                                What is the Java equivalent of Perl's qq operator?
                            
                                How can I control the Perl version used when submitting grid jobs?
                            
                                How can I use a Perl module from Python?
                            
                                lighttpd + perl + mojolicious =?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Perl LWP::UserAgent mishandling UTF-8 response

Tags:

unicode

utf-8

perl