Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regexp to check if code contains non-UTF-8 characters?

I am using PMD, checkstyle, findbugs, etc. in Sonar. I would like to have a rule verifying that Java code contains no characters not part of UTF-8.

E.g. the character � should not be allowed

I could not find a rule for this in the above plugins, but I guess a custom rule can be made in Sonar.

like image 994
user1340582 Avatar asked Oct 29 '12 06:10

user1340582


1 Answers

Here is the regular expression which will match only valid UTF-8 byte sequences:

/^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\x80-\xBF]{2}|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$/

I have derived it from RFC 3629 UTF-8, a transformation format of ISO 10646 section 4 - Syntax of UTF-8 Byte Sequences.

Factorizing the above gives the slightly shorter:

/^([\x00-\x7F]|([\xC2-\xDF]|\xE0[\xA0-\xBF]|\xED[\x80-\x9F]|(|[\xE1-\xEC]|[\xEE-\xEF]|\xF0[\x90-\xBF]|\xF4[\x80-\x8F]|[\xF1-\xF3][\x80-\xBF])[\x80-\xBF])[\x80-\xBF])*$/

This simple perl script demonstrates usage:

#!/usr/bin/perl -w
my $passstring = "This string \xEF\xBF\xBD == � is valid UTF-8";
my $failstring = "This string \x{FFFD} == � is not valid UTF-8";
if ($passstring =~ /^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\x80-\xBF]{2}|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$/)
    {
    print 'Passstring passed'."\n";
    }
else
    {
    print 'Passstring did not pass'."\n";
    }
if ($failstring =~ /^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\x80-\xBF]{2}|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$/)
    {
    print 'Failstring passed'."\n";
    }
else
    {
    print 'Failstring did not pass'."\n";
    }
exit;

It produces the following output:

Passstring passed
Failstring did not pass
like image 156
kshepherd Avatar answered Oct 11 '22 13:10

kshepherd