Perl regular expression matching on large Unicode code points

Tags:

I am trying to replace various characters with either a single quote or double quote.

Here is my test file:

# Replace all with double quotes
＂ fullwidth
“ left
” right
„ low
" normal

# Replace all with single quotes
' normal
‘ left
’ right
‚ low
‛ reverse
` backtick

I'm trying to do this...

perl -Mutf8 -pi -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/ug" test.txt
perl -Mutf8 -pi -e 's/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/ug' text.txt

But only the backtick character gets replaced properly. I think it has something to do with the other code points being too large, but I cannot find any documentation on this.

Here I have a one-liner which dumps the Unicode code points, to verify they match my regular expression.

$ awk -F\  '{print $1}' test.txt | \
    perl -C7 -ne 'for(split(//)){print sprintf("U+%04X", ord)." ".$_."\n"}'

U+FF02 ＂
U+201C “
U+201D ”
U+201E „
U+0022 "

U+0027 '
U+2018 ‘
U+2019 ’
U+201A ‚
U+201B ‛
U+0060 `

Why isn't my regular expression matching?

820

asked Oct 01 '12 20:10

David Chan

2 Answers

It isn’t matching because you forgot the -CSAD in your call to Perl, and don’t have $PERL_UNICODE set in your environment. You have only said -Mutf8 to announce that your source code is in that encoding. This does not affect your I/O.

You need:

$ perl -CSAD -pi.orig -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/g" test.txt

I do mention this sort of thing in this answer a couple of times.

answered Sep 17 '22 13:09

tchrist

With use utf8;, you told Perl your source code is UTF-8. This is useless (though harmless) since you've limited your source code to ASCII.

With /u, you told Perl to use the Unicode definitions of \s, \d, \w. This is useless (though harmless) since you don't use any of those patterns.

You did not decode your input, so your inputs consists solely of bytes, so most of the characters in your class (e.g. \x{2018}) can't possibly match anything. You need to decode your input (and of course, encode your output). Using -CSD will likely do this.

perl -CSD -i -pe'
   s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/\x27/g;
   s/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/g;
' text.txt

answered Sep 18 '22 13:09

ikegami

Related questions
                            
                                Simple Java regex matcher not working
                            
                                Duplicate a line of text in notepad++? [closed]
                            
                                Regular expressions, allow specific format only. "John-doe"
                            
                                How do I configure Jenkins to strip the leading "origin/" in git branch parameter?
                            
                                What is the effect of "*" in regular expressions?
                            
                                fastest way to compare strings in python
                            
                                (.*) instead of (.*?)
                            
                                JavaScript/jQuery removing character 160 from a node's text() value - Regex
                            
                                preg_match: number-alphabets and commas only
                            
                                How to discover financial Year based on current datetime?
                            
                                How to grep a word exactly
                            
                                Finding and replacing lines that begin with a pattern
                            
                                How can I consistently convert strings like "3.71B" and "4M" to numbers in Python?
                            
                                Password must have at least one non-alpha character [duplicate]
                            
                                Shell scripting using grep to split a string
                            
                                how to remove all the "() and text within it" in Java String
                            
                                Check if letter is emoji
                            
                                How to parse <br> in a string to html tag in VUE.js
                            
                                Which is the more efficient regex?
                            
                                Javascript :: How to get keys of associative array to array variable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Perl regular expression matching on large Unicode code points

Tags:

regex

encoding

unicode

perl

David Chan

People also ask

2 Answers

tchrist

ikegami

Recent Activity

Donate For Us