regex and index don't match unicode chars

Question

One of the functions in a library I'm writing returns a string which is problematic when trying to locate unicode characters via either a regex or the index function. The string prints normally (using Sublime text's console for unicode printing) like this:

<xml>V日한ế</xml>

And I'm trying to match it like this: $string =~ m/V日한ế/. I'm using utf8.

I apologize that I am unable to reproduce a minimal breaking example, because when I construct the string myself and try to match it, everything works just fine. I tried using the hexdump function from this site, but it prints the same hex sequences for the unicode characters in the string returned by the library and the string that I construct ($string2 = 'V日한ế'): 56 e6 97 a5 ed 95 9c e1 ba bf. The one from the library has the utf flag turned off and the constructed one doesn't, but another test showed me that that wasn't the problem.

I only have one clue as to the source of the problem: the output with use re 'debug';. It gives the following message:

Matching REx "V%x{65e5}%x{d55c}%x{1ebf}" against "%n<xml>V%x{e6}%x{97}%x{a5}%x{ed}%x{95}%x{9c}%x{e1}%x{ba}"...

It is printing the character "日" in the regex as %x{65e5} and the same character in the problematic string as %x{e6}%x{97}. The other unicode characters are similarly printed differently.

Can anyone with experience debugging strings and encodings tell me why regex and index can't find the unicode characters I know to be present in my string, and how I can make these functions find them?

amon · Accepted Answer

Let's make a reproducible test case:

generating an input file:

$ perl -E'say "<xml>V\xe6\x97\xa5\xed\x95\x9c\xe1\xba\xbf</xml>"' >test.xml
$ cat test.xml
<xml>V日한ế</xml>

This writes some bytes to a file. Note that my terminal emulator uses UTF-8.

Trying to naively match the input:

$ cat test.pl
use strict; use warnings; use utf8; use autodie; use feature 'say';
open my $fh, "<", shift @ARGV;

my $s = <$fh>;
say "$s ", $s =~ m/V日한ế/ ? "matches" : "doesn't match";
say "string = ", map { sprintf "\x{%x}", ord } split //, $s;
$ perl test.pl test.xml
<xml>V日한ế</xml>
 doesn't match
string = \x{3c}\x{78}\x{6d}\x{6c}\x{3e}\x{56}\x{e6}\x{97}\x{a5}\x{ed}\x{95}\x{9c}\x{e1}\x{ba}\x{bf}\x{3c}\x{2f}\x{78}\x{6d}\x{6c}\x{3e}\x{a}

Oh, so the string from the file is seen as a string of bytes, not properly decoded codepoints. Who would have guessed?

Let's add the :utf8 PerlIO-layer:

$ cat test-utf8.pl
use strict; use warnings; use utf8; use autodie; use feature 'say';
open my $fh, "<:utf8", shift @ARGV;

my $s = <$fh>;
say "$s ", $s =~ m/V日한ế/ ? "matches" : "doesn't match";
say "string = ", map { sprintf "\x{%x}", ord } split //, $s;
$ perl test-utf8.pl test.xml
Wide character in say at test-utf8.pl line 5, <$_[...]> line 1.
<xml>V日한ế</xml>
 matches
string = \x{3c}\x{78}\x{6d}\x{6c}\x{3e}\x{56}\x{65e5}\x{d55c}\x{1ebf}\x{3c}\x{2f}\x{78}\x{6d}\x{6c}\x{3e}\x{a}

Now it matches, because we have read the correctly decoded codepoints from the file.

Do you get the same output? If you don't get comparable output, what perl/OS-combination are you using (this is perl 5.18.1 on Ubuntu GNU/Linux).

There are some remaining issues with this code: There are multiple ways to represent ế. You should therefore normalize the string in the regex and your input:

use Unicode::Normalize 'NFC';
my $regex_body = NFC "V日한ế";
my $s          = NFC scalar <$fh>;

... m/\Q$regex_body/ ...

regex and index don't match unicode chars

Tags:

string

regex

character-encoding

perl

Nate Glenn

1 Answers

amon

Recent Activity

Donate For Us

regex and index don't match unicode chars

Tags:

string

regex

character-encoding

perl

Nate Glenn

1 Answers

amon

Related questions

Recent Activity

Donate For Us