I'd like to transliterate curly single and double quotes to neutral quotes in a document. I thought it should be as simple as perl -pe 'tr/“”’/""\047/', but that doesn't work. For example:
snafu$ echo '“' | perl -pe 'tr/“”’/""\047/'
""'
snafu$ echo '“”’' | perl -pe 'tr/“”’/""\047/'
""'""'""'
Notice that a single “ becomes the full set of characters on the right. In the second example, it happens three times.
And, even less expected (for me) is that the triple occurs even for this trivial case:
snafu$ echo '“' | perl -pe 'tr/“/"/'
"""
This behavior seems to be very different what I see with ASCII characters, like this:
snafu$ echo "Larry Wall" | perl -pe 'tr/ay/AY/'
LArrY WAll
I have also tried invocations with perl -Mutf8, but that also didn't do what I expected:
# not triplicated, but also not transliterated
snafu$ echo '“' | perl -Mutf8 -pe 'tr/“”’/""\047/'
“
What explains the above behaviors of tr///?
You want the following:
perl -CS -Mutf8 -pe 'tr/“”’/""\047/'
Without use utf8;, Perl expects the source code to be encoded using ASCII. So your first snippet can't possibly contain “, ” and ’. Since string literals are "8-bit clean", your first snippet is equivalent to
tr/\xE2\x80\x9C\xE2\x80\x9D\xE2\x80\x99/""'/
That's obviously incorrect. To fix this, add use utf8; like you did in your last snippet.
So why doesn't the last snippet work? That's because it's effectively doing
"\xE2\x80\x9C" =~ tr/\x{201C}\x{201D}\x{2019}/""'/;
That's also obviously incorrect. You're searching encoded text (string of UTF-8 bytes) for decoded text (string of Unicode Code Points). You need to decode your inputs, and encode your outputs. Then can be achieved using use open ":std", ":encoding(UTF-8)";, but -CS can be used here.
Finally, there's
echo '“' | perl -pe 'tr/“/"/'
From the above, we know it's equivalent to the following:
"\xE2\x80\x9C" =~ tr/\xE2\x80\x9C/"/;
Unless you use /d, tr/// repeats the last character if there are fewer characters on the right than on the left. This makes the above equivalent to
"\xE2\x80\x9C" =~ tr/\xE2\x80\x9C/"""/;
And that explains the """ output.
To explain your less complex example
snafu$ echo '“' | perl -pe 'tr/“/"/'
"""
The UTF-8 for “ is the sequence e2 80 9c. Because everything is treated as ASCII characters (bytes), your translation command will replace each of these to ". That's why you are getting three double quotes.
A similar thing happens in your first example. But because the search string has 9 ASCII characters and the replacement has 3, only the substitutions that map are considered. The first two UTF-8 bytes of all your characteres (“”’) are identical, so when treated as ASCII they map to the first two characters in the substitution string. Then, the third byte of “ maps to the third character in the substitution string. But the third byte of the other two have no mappings and are replaced with the last character of the substitution string. You can see this more clearly if you add a fourth character in your replacement string. For example, tr/“”’/""\047z/ will output ""'""z""z if the input is “”’.
Your code is not wrong. If you write a script into a file and properly use utf8 and binmode, it will work as expected:
use utf8;
binmode STDIN, ":utf8";
my $s = <STDIN>;
$s =~ tr/“”’/""'/;
print "$s";
Output:
""'
So what you need is to tell Perl to treat STDIN as UTF-8 from the command line. You can do that with -C1 or the more common option -CS, which will treat STDIN, STDOUT, and STDERR as UTF-8.
`echo '“”’' | perl -Mutf8 -CS -pe 'tr/“”’/""\047/'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With