I am trying to replace various characters with either a single quote or double quote.
Here is my test file:
# Replace all with double quotes
" fullwidth
“ left
” right
„ low
" normal
# Replace all with single quotes
' normal
‘ left
’ right
‚ low
‛ reverse
` backtick
I'm trying to do this...
perl -Mutf8 -pi -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/ug" test.txt
perl -Mutf8 -pi -e 's/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/ug' text.txt
But only the backtick character gets replaced properly. I think it has something to do with the other code points being too large, but I cannot find any documentation on this.
Here I have a one-liner which dumps the Unicode code points, to verify they match my regular expression.
$ awk -F\ '{print $1}' test.txt | \
perl -C7 -ne 'for(split(//)){print sprintf("U+%04X", ord)." ".$_."\n"}'
U+FF02 "
U+201C “
U+201D ”
U+201E „
U+0022 "
U+0027 '
U+2018 ‘
U+2019 ’
U+201A ‚
U+201B ‛
U+0060 `
Why isn't my regular expression matching?
RegexBuddy's regex engine is fully Unicode-based starting with version 2.0. 0.
Word Boundary: \b The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).
While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. Also, the use of Unicode may present security issues that aren't obvious, see "Security Implications of Unicode" below.
It isn’t matching because you forgot the -CSAD
in your call to Perl, and don’t have $PERL_UNICODE
set in your environment. You have only said -Mutf8
to announce that your source code is in that encoding. This does not affect your I/O.
You need:
$ perl -CSAD -pi.orig -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/g" test.txt
I do mention this sort of thing in this answer a couple of times.
With use utf8;
, you told Perl your source code is UTF-8. This is useless (though harmless) since you've limited your source code to ASCII.
With /u
, you told Perl to use the Unicode definitions of \s
, \d
, \w
. This is useless (though harmless) since you don't use any of those patterns.
You did not decode your input, so your inputs consists solely of bytes, so most of the characters in your class (e.g. \x{2018}
) can't possibly match anything. You need to decode your input (and of course, encode your output). Using -CSD
will likely do this.
perl -CSD -i -pe'
s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/\x27/g;
s/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/g;
' text.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With