Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex and index don't match unicode chars

One of the functions in a library I'm writing returns a string which is problematic when trying to locate unicode characters via either a regex or the index function. The string prints normally (using Sublime text's console for unicode printing) like this:

<xml>V日한ế</xml>

And I'm trying to match it like this: $string =~ m/V日한ế/. I'm using utf8.

I apologize that I am unable to reproduce a minimal breaking example, because when I construct the string myself and try to match it, everything works just fine. I tried using the hexdump function from this site, but it prints the same hex sequences for the unicode characters in the string returned by the library and the string that I construct ($string2 = 'V日한ế'): 56 e6 97 a5 ed 95 9c e1 ba bf. The one from the library has the utf flag turned off and the constructed one doesn't, but another test showed me that that wasn't the problem.

I only have one clue as to the source of the problem: the output with use re 'debug';. It gives the following message:

Matching REx "V%x{65e5}%x{d55c}%x{1ebf}" against "%n<xml>V%x{e6}%x{97}%x{a5}%x{ed}%x{95}%x{9c}%x{e1}%x{ba}"...

It is printing the character "日" in the regex as %x{65e5} and the same character in the problematic string as %x{e6}%x{97}. The other unicode characters are similarly printed differently.

Can anyone with experience debugging strings and encodings tell me why regex and index can't find the unicode characters I know to be present in my string, and how I can make these functions find them?

like image 429
Nate Glenn Avatar asked Oct 21 '22 02:10

Nate Glenn


1 Answers

Let's make a reproducible test case:

  1. generating an input file:

    $ perl -E'say "<xml>V\xe6\x97\xa5\xed\x95\x9c\xe1\xba\xbf</xml>"' >test.xml
    $ cat test.xml
    <xml>V日한ế</xml>
    

    This writes some bytes to a file. Note that my terminal emulator uses UTF-8.

  2. Trying to naively match the input:

    $ cat test.pl
    use strict; use warnings; use utf8; use autodie; use feature 'say';
    open my $fh, "<", shift @ARGV;
    
    my $s = <$fh>;
    say "$s ", $s =~ m/V日한ế/ ? "matches" : "doesn't match";
    say "string = ", map { sprintf "\\x{%x}", ord } split //, $s;
    $ perl test.pl test.xml
    <xml>V日한ế</xml>
     doesn't match
    string = \x{3c}\x{78}\x{6d}\x{6c}\x{3e}\x{56}\x{e6}\x{97}\x{a5}\x{ed}\x{95}\x{9c}\x{e1}\x{ba}\x{bf}\x{3c}\x{2f}\x{78}\x{6d}\x{6c}\x{3e}\x{a}
    

    Oh, so the string from the file is seen as a string of bytes, not properly decoded codepoints. Who would have guessed?

  3. Let's add the :utf8 PerlIO-layer:

    $ cat test-utf8.pl
    use strict; use warnings; use utf8; use autodie; use feature 'say';
    open my $fh, "<:utf8", shift @ARGV;
    
    my $s = <$fh>;
    say "$s ", $s =~ m/V日한ế/ ? "matches" : "doesn't match";
    say "string = ", map { sprintf "\\x{%x}", ord } split //, $s;
    $ perl test-utf8.pl test.xml
    Wide character in say at test-utf8.pl line 5, <$_[...]> line 1.
    <xml>V日한ế</xml>
     matches
    string = \x{3c}\x{78}\x{6d}\x{6c}\x{3e}\x{56}\x{65e5}\x{d55c}\x{1ebf}\x{3c}\x{2f}\x{78}\x{6d}\x{6c}\x{3e}\x{a}
    

    Now it matches, because we have read the correctly decoded codepoints from the file.

Do you get the same output? If you don't get comparable output, what perl/OS-combination are you using (this is perl 5.18.1 on Ubuntu GNU/Linux).

There are some remaining issues with this code: There are multiple ways to represent ế. You should therefore normalize the string in the regex and your input:

use Unicode::Normalize 'NFC';
my $regex_body = NFC "V日한ế";
my $s          = NFC scalar <$fh>;

... m/\Q$regex_body/ ...
like image 59
amon Avatar answered Oct 27 '22 21:10

amon