Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wide character error using utf8 pragma with HTML::Laundry

I am having trouble with the HTML::Laundry module. The following snippet demonstrates what happens when using use utf8 or not. Enabling use utf8 results in an error:

Wide character in subroutine entry at /usr/local/share/perl/5.14.2/HTML/Laundry.pm line 329

Without use utf8 the result is correct, but in context of my program I need the utf8 pragma.

use utf8;
use HTML::Laundry;
use strict;

my $snippet = "<p style=\"line-height: 18px; font-family: Verdana, Arial, Helvetica, sans-serif; color: rgb(153, 153, 153); margin: 0px; padding: 0px;\"><br>Sämtliche Produkte von collec entstehen in Zusammenarbeit mit Schweizer Werkstätten. collec setzt sich dafür ein, dass auch Menschen, die an geschützten Arbeitsplätzen tätig sind, hochwertige Produkte herstellen können. collec macht sich stark für die Erhaltung von Handarbeit und Handwerk, denn „Handwerk berührt das Denken.“</p>";

my $clean = HTML::Laundry->new();
$clean->remove_acceptable_element(['font','span']);
$clean->remove_acceptable_attribute(['class','style']);
print $clean->clean($snippet);                            

The program file itself is clear UTF-8

file -i cleantest.pl 
cleantest.pl: text/plain; charset=utf-8
like image 970
xherbie Avatar asked Aug 06 '14 08:08

xherbie


1 Answers

Peeking at the source, it looks like HTML::Laundry is initializing HTML::Parser with the utf8_mode flag set. This flag causes HTML::Parser to expect its input to be given as an undecoded UTF-8 byte stream, rather than as a Unicode character stream.

You might want to file a bug report / feature request on HTML::Laundry about this, asking for some way to make it properly handle Unicode input. In the mean time, though, there's an obvious work-around: just encode the input as UTF-8 before passing it to HTML::Laundry:

use Encode qw(encode_utf8);

print $clean->clean(encode_utf8 $snippet);

or:

utf8::encode($snippet);    # encode to UTF-8 in place
print $clean->clean($snippet);
like image 160
Ilmari Karonen Avatar answered Sep 29 '22 13:09

Ilmari Karonen