Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NBSP malformed while using Mojo::DOM

I am using Mojo::DOM Perl module to replace <IMG> tag, but the &nbsp; entity is replace by Mojo::DOM with \xa0, but when I print it out to the page the NBSP character becomes \x{fffd} and shows up as a question mark. I have tried replace \x{00a0} with &nbsp; but doing that corrupts another unicode character. Here's my code:

#!/usr/bin/perl

use utf8;
use strict;
use warnings;
use CGI;

my $cgi = new CGI;

print $cgi->header(-charset => 'utf-8');

my %params = $cgi->Vars;

print q[<html><head><title>UTF-8 Test</title></head><body><form method="POST"><textarea name="msg" cols="50" rows="20">].$params{msg}.q[</textarea><br/><br/><input type="submit"></form>];

if($ENV{REQUEST_METHOD} eq 'POST') {
    require Mojo::DOM;


    my $dom = Mojo::DOM->new($params{msg});

    for my $e ($dom->find('img')->each) {
          my $x = $e->attr('data-char');

          if(defined($x) && $x) {
             $e->replace($x);
          }
          else {
              $e->delete;
          }
    }

    $params{msg} = $dom->to_string();
    print '<hr/><div>'.$params{msg}.'</div>';
}

print q[</body></html>];

Contents of msg param that is POSTed:

אֱלֹהִים,+אֵת+הַשָּׁמַיִם,+וְאֵת+הָאָרֶץ. 1 In the beginningpo &nbsp;<img src="p.jpg" data-char="😎"> Easy Bengali Typing: বাংলা টাইপ করুন Минюст РФ опубликовал список СМИ-иноагентов Japanese Keyboard - 日本語のキーボード Pre-Qin and Han (先秦兩漢)

Here's a screenshot of the output:

enter image description here

like image 788
Pradeep Avatar asked Dec 06 '17 08:12

Pradeep


1 Answers

Mojo::DOM expects to be working with characters, not UTF-8 encoded bytes, and so it's likely that it decodes &nbsp; to a character that then needs to be encoded to UTF-8 before output. The old CGI module does not decode your input parameters or encode your output like a modern framework would. So you need to handle this yourself: decode $params{msg} from UTF-8 before passing it to Mojo::DOM, and then encode it back to UTF-8 before putting it in the output (you are declaring an output charset of UTF-8 after all).

if($ENV{REQUEST_METHOD} eq 'POST') {
    require Mojo::DOM;
    require Mojo::Util;


    my $dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $params{msg}));

    for my $e ($dom->find('img')->each) {
          my $x = $e->attr('data-char');

          if(defined($x) && $x) {
             $e->replace($x);
          }
          else {
              $e->delete;
          }
    }

    $params{msg} = Mojo::Util::encode('UTF-8', $dom->to_string());
    print '<hr/><div>'.$params{msg}.'</div>';
}
like image 151
Grinnz Avatar answered Nov 15 '22 06:11

Grinnz