One of the functions in a library I'm writing returns a string which is problematic when trying to locate unicode characters via either a regex or the index
function. The string prints normally (using Sublime text's console for unicode printing) like this:
<xml>V日한ế</xml>
And I'm trying to match it like this: $string =~ m/V日한ế/
. I'm using utf8
.
I apologize that I am unable to reproduce a minimal breaking example, because when I construct the string myself and try to match it, everything works just fine. I tried using the hexdump
function from this site, but it prints the same hex sequences for the unicode characters in the string returned by the library and the string that I construct ($string2 = 'V日한ế'
): 56 e6 97 a5 ed 95 9c e1 ba bf
. The one from the library has the utf flag turned off and the constructed one doesn't, but another test showed me that that wasn't the problem.
I only have one clue as to the source of the problem: the output with use re 'debug';
. It gives the following message:
Matching REx "V%x{65e5}%x{d55c}%x{1ebf}" against "%n<xml>V%x{e6}%x{97}%x{a5}%x{ed}%x{95}%x{9c}%x{e1}%x{ba}"...
It is printing the character "日" in the regex as %x{65e5}
and the same character in the problematic string as %x{e6}%x{97}
. The other unicode characters are similarly printed differently.
Can anyone with experience debugging strings and encodings tell me why regex and index
can't find the unicode characters I know to be present in my string, and how I can make these functions find them?
Let's make a reproducible test case:
generating an input file:
$ perl -E'say "<xml>V\xe6\x97\xa5\xed\x95\x9c\xe1\xba\xbf</xml>"' >test.xml
$ cat test.xml
<xml>V日한ế</xml>
This writes some bytes to a file. Note that my terminal emulator uses UTF-8.
Trying to naively match the input:
$ cat test.pl
use strict; use warnings; use utf8; use autodie; use feature 'say';
open my $fh, "<", shift @ARGV;
my $s = <$fh>;
say "$s ", $s =~ m/V日한ế/ ? "matches" : "doesn't match";
say "string = ", map { sprintf "\\x{%x}", ord } split //, $s;
$ perl test.pl test.xml
<xml>V日한ế</xml>
doesn't match
string = \x{3c}\x{78}\x{6d}\x{6c}\x{3e}\x{56}\x{e6}\x{97}\x{a5}\x{ed}\x{95}\x{9c}\x{e1}\x{ba}\x{bf}\x{3c}\x{2f}\x{78}\x{6d}\x{6c}\x{3e}\x{a}
Oh, so the string from the file is seen as a string of bytes, not properly decoded codepoints. Who would have guessed?
Let's add the :utf8
PerlIO-layer:
$ cat test-utf8.pl
use strict; use warnings; use utf8; use autodie; use feature 'say';
open my $fh, "<:utf8", shift @ARGV;
my $s = <$fh>;
say "$s ", $s =~ m/V日한ế/ ? "matches" : "doesn't match";
say "string = ", map { sprintf "\\x{%x}", ord } split //, $s;
$ perl test-utf8.pl test.xml
Wide character in say at test-utf8.pl line 5, <$_[...]> line 1.
<xml>V日한ế</xml>
matches
string = \x{3c}\x{78}\x{6d}\x{6c}\x{3e}\x{56}\x{65e5}\x{d55c}\x{1ebf}\x{3c}\x{2f}\x{78}\x{6d}\x{6c}\x{3e}\x{a}
Now it matches, because we have read the correctly decoded codepoints from the file.
Do you get the same output? If you don't get comparable output, what perl/OS-combination are you using (this is perl 5.18.1 on Ubuntu GNU/Linux).
There are some remaining issues with this code: There are multiple ways to represent ế
. You should therefore normalize the string in the regex and your input:
use Unicode::Normalize 'NFC';
my $regex_body = NFC "V日한ế";
my $s = NFC scalar <$fh>;
... m/\Q$regex_body/ ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With