What is the best way to find out if the scalar value is ASCII/UTF8 (text) or a binary data in Perl? Is this code right?:
if (is_utf8($scalar, 1) or ($scalar =~ m/\A [[:ascii:]]* \Z/xms)) {
# $scalar is a text
}
else {
# $scalar is a binary
}
Is there a better way?
is_utf8 tests whether the Perl utf8 flag is turned on or not. It's possible for a scalar to contain correctly formed utf-8 and not have the flag turned on. I think it's possible to deliberately turn the flag on even with malformed utf-8 too, but I'm not sure.
To check whether the scalar contains UTF-8 data, you need to check the flag, and if it is not, also try something like
eval {
my $utf8 = decode_utf8 ($scalar);
}
and then check for errors in $@.
To check whether a non-UTF-8 scalar contains non-ASCII data, your idea $scalar =~ m/\A [[:ascii:]]* \Z/xms looks ok.
The best way, clearly, is to simply keep track when you are reading the data. You as the programmer should already know whether you are getting text (and its encoding) or binary data. When you're reading text, you Encode::decode() it (see http://p3rl.org/UNI for details) into Perl text strings.
If you really don't know beforehand, the -T and -B file tests offer a heuristic.
Disregard Kinopiko's answer, in the vast majority of cases, you should not need to know about the internal representation of data, and messing with the utility functions from the utf8 pragma module is the wrong approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With