Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl regular expression matching on large Unicode code points

I am trying to replace various characters with either a single quote or double quote.

Here is my test file:

# Replace all with double quotes
" fullwidth
“ left
” right
„ low
" normal

# Replace all with single quotes
' normal
‘ left
’ right
‚ low
‛ reverse
` backtick

I'm trying to do this...

perl -Mutf8 -pi -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/ug" test.txt
perl -Mutf8 -pi -e 's/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/ug' text.txt

But only the backtick character gets replaced properly. I think it has something to do with the other code points being too large, but I cannot find any documentation on this.

Here I have a one-liner which dumps the Unicode code points, to verify they match my regular expression.

$ awk -F\  '{print $1}' test.txt | \
    perl -C7 -ne 'for(split(//)){print sprintf("U+%04X", ord)." ".$_."\n"}'

U+FF02 "
U+201C “
U+201D ”
U+201E „
U+0022 "

U+0027 '
U+2018 ‘
U+2019 ’
U+201A ‚
U+201B ‛
U+0060 `

Why isn't my regular expression matching?

like image 820
David Chan Avatar asked Oct 01 '12 20:10

David Chan


People also ask

Does regex work with Unicode?

RegexBuddy's regex engine is fully Unicode-based starting with version 2.0. 0.

What is \b in Perl regex?

Word Boundary: \b The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).

Does Perl support Unicode?

While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. Also, the use of Unicode may present security issues that aren't obvious, see "Security Implications of Unicode" below.


2 Answers

It isn’t matching because you forgot the -CSAD in your call to Perl, and don’t have $PERL_UNICODE set in your environment. You have only said -Mutf8 to announce that your source code is in that encoding. This does not affect your I/O.

You need:

$ perl -CSAD -pi.orig -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/g" test.txt

I do mention this sort of thing in this answer a couple of times.

like image 91
tchrist Avatar answered Sep 17 '22 13:09

tchrist


With use utf8;, you told Perl your source code is UTF-8. This is useless (though harmless) since you've limited your source code to ASCII.

With /u, you told Perl to use the Unicode definitions of \s, \d, \w. This is useless (though harmless) since you don't use any of those patterns.

You did not decode your input, so your inputs consists solely of bytes, so most of the characters in your class (e.g. \x{2018}) can't possibly match anything. You need to decode your input (and of course, encode your output). Using -CSD will likely do this.

perl -CSD -i -pe'
   s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/\x27/g;
   s/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/g;
' text.txt
like image 32
ikegami Avatar answered Sep 18 '22 13:09

ikegami