Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get the length of a Perl Unicode string input via Ajax or CGI?

Okay, this should be really simple, but I have searched all over for the answer and also read the following thread: How do I find the length of a Unicode string in Perl?

It does not help me. I know how to get Perl to treat a string constant as UTF-8 and return the right number of chars (instead of bytes) but somehow it doesn't work when Perl receives the string via my AJAX call.

Below, I am posting the three Greek letters Alpha, Beta and Omega in unicode. Perl tells me length is 6 (bytes) when it should tell me only 3 (chars). How do I get the correct char count?

#!/usr/bin/perl
use strict;

if ($ENV{CONTENT_LENGTH}) {
    binmode (STDIN, ":utf8");
    read (STDIN, $_, $ENV{CONTENT_LENGTH});
    s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
    print "Content-Type: text/html; charset=UTF-8\n\nReceived: $_ (".length ($_)." chars)";
    exit;
}

print "Content-Type: text/html; charset=UTF-8\n\n";
print qq[<html><head><script>
        var oRequest;
        function MakeRequest () {
            oRequest = new XMLHttpRequest();
            oRequest.onreadystatechange = zxResponse;
            oRequest.open ('POST', '/test/unicode.cgi', true);
            oRequest.send (encodeURIComponent (document.oForm.oInput.value));
        }
        function zxResponse () {
            if (oRequest.readyState==4 && oRequest.status==200) {
                alert (oRequest.responseText);
            }
        }
    </script></head><body>
        <form name="oForm" method="POST">
            <input type="text" name="oInput" value="&#x03B1;&#x03B2;&#x03A9;">
            <input type="button" value="Ajax Submit" onClick="MakeRequest();">
        </form>
    </body></html>
];

By the way, the code is intentially simplified (I know how to make a cross-browser AJAX call, etc.) and using the CGI Perl module is not an option.

like image 267
W3Coder Avatar asked Dec 10 '22 13:12

W3Coder


2 Answers

You decode this string before calling length. For example:

use Encode;

my $utf_string = decode_utf8($_); ## parse string to find utf8 octets
print length($utf_string);

From encode manual:

$string = decode_utf8($octets [, CHECK]);

equivalent to $string = decode("utf8", $octets [, CHECK]) . The sequence of octets represented by $octets is decoded from UTF-8 into a sequence of logical characters. Not all sequences of octets form valid UTF-8 encodings, so it is possible for this call to fail. For CHECK, see Handling Malformed Data.

like image 71
Ivan Nevostruev Avatar answered Dec 14 '22 23:12

Ivan Nevostruev


For a "native" way to accomplish this, you can convert as you copy with this method:

Set the mode on an in memory file to the mode desired and read from that. This will make the conversion as the characters are read.

use strict;
use warnings;

my $utf_str = "αβΩ"; #alpha; bravo; omega

print "$utf_str is ", length $utf_str, " characters\n";

use open ':encoding(utf8)';
open my $fh, '<', \$utf_str;

my $new_str;

{ local $/; $new_str=<$fh>; }

binmode(STDOUT, ":utf8");
print "$new_str ", length $new_str, " characters"; 

#output:
αβΩ is 6 characters
αβΩ 3 characters

If you want to convert the encoding in place, you can use this:

my $utf_str = "αβΩ";
print "$utf_str is ", length $utf_str, " characters\n";
binmode(STDOUT, ":utf8");
utf8::decode($utf_str);
print "$utf_str is ", length $utf_str, " characters\n";

#output:
αβΩ is 6 characters
αβΩ is 3 characters

You should not shy away from Encode however.

like image 37
dawg Avatar answered Dec 14 '22 23:12

dawg