Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I guess if a string has text or binary data in Perl?

Tags:

perl

What is the best way to find out if the scalar value is ASCII/UTF8 (text) or a binary data in Perl? Is this code right?:

if (is_utf8($scalar, 1) or ($scalar =~ m/\A [[:ascii:]]* \Z/xms)) {
     # $scalar is a text
}
else {
     # $scalar is a binary
}

Is there a better way?

like image 222
Jozef Avatar asked Dec 02 '25 03:12

Jozef


2 Answers

is_utf8 tests whether the Perl utf8 flag is turned on or not. It's possible for a scalar to contain correctly formed utf-8 and not have the flag turned on. I think it's possible to deliberately turn the flag on even with malformed utf-8 too, but I'm not sure.

To check whether the scalar contains UTF-8 data, you need to check the flag, and if it is not, also try something like

eval {
    my $utf8 = decode_utf8 ($scalar);
}

and then check for errors in $@.

To check whether a non-UTF-8 scalar contains non-ASCII data, your idea $scalar =~ m/\A [[:ascii:]]* \Z/xms looks ok.

The best way, clearly, is to simply keep track when you are reading the data. You as the programmer should already know whether you are getting text (and its encoding) or binary data. When you're reading text, you Encode::decode() it (see http://p3rl.org/UNI for details) into Perl text strings.

If you really don't know beforehand, the -T and -B file tests offer a heuristic.

Disregard Kinopiko's answer, in the vast majority of cases, you should not need to know about the internal representation of data, and messing with the utility functions from the utf8 pragma module is the wrong approach.

like image 27
daxim Avatar answered Dec 05 '25 16:12

daxim



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!