Perl command line replace for unicode

Question

I am trying to replace every word (stored in a tmp file called _id) with a number using shell script. It works fine except for unicode words, for which a number is generated but replacement using Perl does not work. The bash code under question is as below:

x=0
for id in `cat _id`; do
    echo $x $id
    perl -p -i -e "s/\b$id\b/$x/g" x_graph.dot
    x=$(($x + 1))
done

Can someone please point out to where the bug is?

x=0
for id in `cat _id`; do
    echo $x $id
    perl -p -i -e "s/\b$id\b/$x/g" x_graph.dot
    x=$(($x + 1))
done

Can someone please point out to where the bug is?

ikegami · Accepted Answer

Let's say you have é (U+00E9) encoded using UTF-8: C3 A9. Since you don't do any decoding, you obtain the string that's produced by "\xC3\xA9".

Regular expressions —or rather \b, \w, \d, etc— expect the input to be Unicode Code Points, which means you are effectively providing U+00C3 and U+00A9 instead of U+00E9. U+00C3 is a word character, but U+00A9 isn't, so the second \b doesn't match where it's expected to match.

So you need to decode your inputs and encode your outputs. -C provides a convenient way of doing that for UTF-8.

perl -i -CSDA -pe'
   BEGIN {
      ($id, $x) = splice(@ARGV, 0, 2);
      die "Bad id" if $id !~ /^\w(?:.*\w)?\z/s;
   }

   s/\b\Q$id\E\b/$x/g
' "$id" "$x" x_graph.dot

Notes:

By using command-line arguments to pass the arguments, I fixed an injection error.
The use of \b assumes that $id will always start with a \w char and always end with a \w char, so I added a check to verify that assumption.
By using \Q..\E to convert the id into a regex pattern, I fixed an injection error.

Test:

$ printf "é
" >_id

$ printf "[é]
" >x_graph.dot

$ x=0

$ id=`cat _id`

$ perl -i -CSDA -pe'
   BEGIN {
      ($id, $x) = splice(@ARGV, 0, 2);
      die "Bad id" if $id !~ /^\w(?:.*\w)?\z/s;
   }

   s/\b\Q$id\E\b/$x/g
' "$id" "$x" x_graph.dot

$ cat x_graph.dot
[0]

Sinan Ünür · Answer

See perldoc perlrun:

`-C` [number/list]

The -C flag controls some of the Perl Unicode features:

I     1   STDIN is assumed to be in UTF-8
O     2   STDOUT will be in UTF-8
E     4   STDERR will be in UTF-8
S     7   I + O + E
i     8   UTF-8 is the default PerlIO layer for input streams
o    16   UTF-8 is the default PerlIO layer for output streams
D    24   i + o
A    32   the @ARGV elements are expected to be strings encoded
          in UTF-8

So, at the very least, you'd want perl -COi, but perl -CSD looks tidier.

In addition, you may want to use

u match according to Unicode rules

with your s///. Or, write:

perl -CSD -Mutf8 -Mfeature=unicode_strings -p -i -e "s/\b$id\b/$x/g" x_graph.dot

Note the use of single quotation marks instead of double so as to avoid unintended interpolation.

Perl command line replace for unicode

Tags:

regex

bash

unicode

perl

xinit

2 Answers

ikegami

`-C` [number/list]

Sinan Ünür

Recent Activity

Donate For Us

Perl command line replace for unicode

Tags:

regex

bash

unicode

perl

xinit

2 Answers

ikegami

-C [number/list]

Sinan Ünür

Related questions

Recent Activity

Donate For Us

`-C` [number/list]