Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get Perl to detect bad UTF-8 sequences?

I'm running Perl 5.10.0 and Postgres 8.4.3, and strings into a database, which is behind a DBIx::Class.

These strings should be in UTF-8, and therefore my database is running in UTF-8. Unfortunatly some of these strings are bad, containing malformed UTF-8, so when I run it I'm getting an exception

DBI Exception: DBD::Pg::st execute failed: ERROR: invalid byte sequence for encoding "UTF8": 0xb5

I thought that I could simply ignore the invalid ones, and worry about the malformed UTF-8 later, so using this code, it should flag and ignore the bad titles.

if(not utf8::valid($title)){
   $title="Invalid UTF-8";
}
$data->title($title);
$data->update();

However Perl seems to think that the strings are valid, but it still throws the exceptions.

How can I get Perl to detect the bad UTF-8?

like image 475
gorilla Avatar asked Apr 16 '10 22:04

gorilla


People also ask

How do I check if a UTF-8 file is valid?

You can use isutf8 from the moreutils collection. In a shell script, use the --quiet switch and check the exit status, which is zero for files that are valid utf-8.

Does Perl support Unicode?

While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. Also, the use of Unicode may present security issues that aren't obvious, see "Security Implications of Unicode" below.

How do I identify a UTF-8 character?

If our byte is positive (8th bit set to 0), this mean that it's an ASCII character. if ( myByte >= 0 ) return myByte; Codes greater than 127 are encoded into several bytes. On the other hand, if our byte is negative, this means that it's probably an UTF-8 encoded character whose code is greater than 127.

What is an invalid UTF-8 string?

Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages. We'll get an error if we attempt to store these characters to a variable or run a file that contains them.


2 Answers

How are you getting your strings? Are you sure that Perl thinks that they are UTF-8 already? If they aren't decoded yet (that is, octets interpreted as some encoding), you need to do that yourself:

    use Encode;

    my $ustring =
      eval { decode( 'utf8', $byte_string, FB_CROAK ) }
      or die "Could not decode string: $@";

Better yet, if you know that your source of strings is already UTF-8, you need to read that source as UTF-8. Look at the code you have that gets the strings to see if you are doing that properly.

like image 37
brian d foy Avatar answered Sep 22 '22 12:09

brian d foy


First off, please follow the documentation - the utf8 module should only be used in the 'use utf8;' form to indicate that your source code is UTF-8 instead of Latin-1. Don't use any of the utf8 functions.

Perl makes the distinction between bytes and UTF-8 strings. In byte mode, Perl doesn't know or care what encoding you are using, and will use Latin-1 if you print it. Take for example the Euro sign (€). In UTF-8 this is 3 bytes, 0xE2, 0x82, 0xAC. If you print the length of these bytes, Perl will return 3. Again, it doesn't care about the encoding. It can be any bytes or any encoding, legal or illegal.

If you use the Encode module and call Encode::decode("UTF-8', $bytes) you will get a new string which has the so-called UTF8 flag set. Perl now knows your string is in UTF-8, and will return a length of 1.

The problem that utf8::valid only applies to the second type of string. Your strings are probably in the first form, byte mode, and utf8::valid just returns true for anything in byte form. This is documented in the perldoc.

The solution is to get Perl to decode your byte strings as UTF-8, and detect any errors. This can be done with FB_CROAK as brian d foy explains:

my $ustring =
    eval { decode( 'UTF-8', $byte_string, FB_CROAK ) }
    or die "Could not decode string: $@";

You can then catch that error and skip those invalid strings.

Or if you know your code is mostly UTF-8 with a few invalid sequences here and there, you can use:

my $ustring = decode( 'UTF-8', $byte_string );

which uses the default mode of FB_DEFAULT, replacing invalid characters with U+FFFD, the Unicode REPLACEMENT CHARACTER (diamond with question mark in it).

You can then pass the string directly to your database driver in most cases. Some drivers may require you to re-encode the string back to byte form first:

my $byte_string = encode('UTF-8', $ustring);

There are also regexes online that you can use to check for valid UTF-8 sequences before calling decode (check other Stack Overflow answers). If you use those regexes, you don't need to do any encoding or decoding.

Finally, please use UTF-8 rather than utf8 in your calls to decode. The latter is more lax and allows some invalid UTF-8 sequences (such as sequences outside the Unicode range) to be allowed through.

like image 73
rjh Avatar answered Sep 26 '22 12:09

rjh