Okay, this should be really simple, but I have searched all over for the answer and also read the following thread: How do I find the length of a Unicode string in Perl?
It does not help me. I know how to get Perl to treat a string constant as UTF-8 and return the right number of chars (instead of bytes) but somehow it doesn't work when Perl receives the string via my AJAX call.
Below, I am posting the three Greek letters Alpha, Beta and Omega in unicode. Perl tells me length is 6 (bytes) when it should tell me only 3 (chars). How do I get the correct char count?
#!/usr/bin/perl
use strict;
if ($ENV{CONTENT_LENGTH}) {
binmode (STDIN, ":utf8");
read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
print "Content-Type: text/html; charset=UTF-8\n\nReceived: $_ (".length ($_)." chars)";
exit;
}
print "Content-Type: text/html; charset=UTF-8\n\n";
print qq[<html><head><script>
var oRequest;
function MakeRequest () {
oRequest = new XMLHttpRequest();
oRequest.onreadystatechange = zxResponse;
oRequest.open ('POST', '/test/unicode.cgi', true);
oRequest.send (encodeURIComponent (document.oForm.oInput.value));
}
function zxResponse () {
if (oRequest.readyState==4 && oRequest.status==200) {
alert (oRequest.responseText);
}
}
</script></head><body>
<form name="oForm" method="POST">
<input type="text" name="oInput" value="αβΩ">
<input type="button" value="Ajax Submit" onClick="MakeRequest();">
</form>
</body></html>
];
By the way, the code is intentially simplified (I know how to make a cross-browser AJAX call, etc.) and using the CGI Perl module is not an option.
You decode this string before calling length
. For example:
use Encode;
my $utf_string = decode_utf8($_); ## parse string to find utf8 octets
print length($utf_string);
From encode manual:
$string = decode_utf8($octets [, CHECK]);
equivalent to $string = decode("utf8", $octets [, CHECK]) . The sequence of octets represented by $octets is decoded from UTF-8 into a sequence of logical characters. Not all sequences of octets form valid UTF-8 encodings, so it is possible for this call to fail. For CHECK, see Handling Malformed Data.
For a "native" way to accomplish this, you can convert as you copy with this method:
Set the mode on an in memory file to the mode desired and read from that. This will make the conversion as the characters are read.
use strict;
use warnings;
my $utf_str = "αβΩ"; #alpha; bravo; omega
print "$utf_str is ", length $utf_str, " characters\n";
use open ':encoding(utf8)';
open my $fh, '<', \$utf_str;
my $new_str;
{ local $/; $new_str=<$fh>; }
binmode(STDOUT, ":utf8");
print "$new_str ", length $new_str, " characters";
#output:
αβΩ is 6 characters
αβΩ 3 characters
If you want to convert the encoding in place, you can use this:
my $utf_str = "αβΩ";
print "$utf_str is ", length $utf_str, " characters\n";
binmode(STDOUT, ":utf8");
utf8::decode($utf_str);
print "$utf_str is ", length $utf_str, " characters\n";
#output:
αβΩ is 6 characters
αβΩ is 3 characters
You should not shy away from Encode
however.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With