Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl command line replace for unicode

I am trying to replace every word (stored in a tmp file called _id) with a number using shell script. It works fine except for unicode words, for which a number is generated but replacement using Perl does not work. The bash code under question is as below:

x=0
for id in `cat _id`; do
    echo $x $id
    perl -p -i -e "s/\b$id\b/$x/g" x_graph.dot
    x=$(($x + 1))
done 

Can someone please point out to where the bug is?

like image 614
xinit Avatar asked Dec 24 '22 15:12

xinit


2 Answers

Let's say you have é (U+00E9) encoded using UTF-8: C3 A9. Since you don't do any decoding, you obtain the string that's produced by "\xC3\xA9".

Regular expressions —or rather \b, \w, \d, etc— expect the input to be Unicode Code Points, which means you are effectively providing U+00C3 and U+00A9 instead of U+00E9. U+00C3 is a word character, but U+00A9 isn't, so the second \b doesn't match where it's expected to match.

So you need to decode your inputs and encode your outputs. -C provides a convenient way of doing that for UTF-8.

perl -i -CSDA -pe'
   BEGIN {
      ($id, $x) = splice(@ARGV, 0, 2);
      die "Bad id" if $id !~ /^\w(?:.*\w)?\z/s;
   }

   s/\b\Q$id\E\b/$x/g
' "$id" "$x" x_graph.dot

Notes:

  • By using command-line arguments to pass the arguments, I fixed an injection error.

  • The use of \b assumes that $id will always start with a \w char and always end with a \w char, so I added a check to verify that assumption.

  • By using \Q..\E to convert the id into a regex pattern, I fixed an injection error.


Test:

$ printf "é\n" >_id

$ printf "[é]\n" >x_graph.dot

$ x=0

$ id=`cat _id`

$ perl -i -CSDA -pe'
   BEGIN {
      ($id, $x) = splice(@ARGV, 0, 2);
      die "Bad id" if $id !~ /^\w(?:.*\w)?\z/s;
   }

   s/\b\Q$id\E\b/$x/g
' "$id" "$x" x_graph.dot

$ cat x_graph.dot
[0]
like image 130
ikegami Avatar answered Jan 10 '23 14:01

ikegami


See perldoc perlrun:

-C [number/list]

The -C flag controls some of the Perl Unicode features:

I     1   STDIN is assumed to be in UTF-8
O     2   STDOUT will be in UTF-8
E     4   STDERR will be in UTF-8
S     7   I + O + E
i     8   UTF-8 is the default PerlIO layer for input streams
o    16   UTF-8 is the default PerlIO layer for output streams
D    24   i + o
A    32   the @ARGV elements are expected to be strings encoded
          in UTF-8

So, at the very least, you'd want perl -COi, but perl -CSD looks tidier.

In addition, you may want to use

u match according to Unicode rules

with your s///. Or, write:

perl -CSD -Mutf8 -Mfeature=unicode_strings -p -i -e "s/\b$id\b/$x/g" x_graph.dot

Note the use of single quotation marks instead of double so as to avoid unintended interpolation.

like image 40
Sinan Ünür Avatar answered Jan 10 '23 12:01

Sinan Ünür