I am trying to replace every word (stored in a tmp file called _id
) with a number using shell script. It works fine except for unicode words, for which a number is generated but replacement using Perl does not work. The bash code under question is as below:
x=0
for id in `cat _id`; do
echo $x $id
perl -p -i -e "s/\b$id\b/$x/g" x_graph.dot
x=$(($x + 1))
done
Can someone please point out to where the bug is?
Let's say you have é
(U+00E9) encoded using UTF-8: C3 A9
. Since you don't do any decoding, you obtain the string that's produced by "\xC3\xA9"
.
Regular expressions —or rather \b
, \w
, \d
, etc— expect the input to be Unicode Code Points, which means you are effectively providing U+00C3 and U+00A9 instead of U+00E9. U+00C3 is a word character, but U+00A9 isn't, so the second \b
doesn't match where it's expected to match.
So you need to decode your inputs and encode your outputs. -C
provides a convenient way of doing that for UTF-8.
perl -i -CSDA -pe'
BEGIN {
($id, $x) = splice(@ARGV, 0, 2);
die "Bad id" if $id !~ /^\w(?:.*\w)?\z/s;
}
s/\b\Q$id\E\b/$x/g
' "$id" "$x" x_graph.dot
Notes:
By using command-line arguments to pass the arguments, I fixed an injection error.
The use of \b
assumes that $id
will always start with a \w
char and always end with a \w
char, so I added a check to verify that assumption.
By using \Q..\E
to convert the id into a regex pattern, I fixed an injection error.
Test:
$ printf "é\n" >_id
$ printf "[é]\n" >x_graph.dot
$ x=0
$ id=`cat _id`
$ perl -i -CSDA -pe'
BEGIN {
($id, $x) = splice(@ARGV, 0, 2);
die "Bad id" if $id !~ /^\w(?:.*\w)?\z/s;
}
s/\b\Q$id\E\b/$x/g
' "$id" "$x" x_graph.dot
$ cat x_graph.dot
[0]
See perldoc perlrun:
-C
[number/list]The
-C
flag controls some of the Perl Unicode features:I 1 STDIN is assumed to be in UTF-8 O 2 STDOUT will be in UTF-8 E 4 STDERR will be in UTF-8 S 7 I + O + E i 8 UTF-8 is the default PerlIO layer for input streams o 16 UTF-8 is the default PerlIO layer for output streams D 24 i + o A 32 the @ARGV elements are expected to be strings encoded in UTF-8
So, at the very least, you'd want perl -COi
, but perl -CSD
looks tidier.
In addition, you may want to use
u
match according to Unicode rules
with your s///
. Or, write:
perl -CSD -Mutf8 -Mfeature=unicode_strings -p -i -e "s/\b$id\b/$x/g" x_graph.dot
Note the use of single quotation marks instead of double so as to avoid unintended interpolation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With