How can I handle utf8 using Perl (or Python) on the command line?
I am trying to split the characters in each word, for example. This is very easy for non-utf8 text, for example:
$ echo "abc def" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
a b c d e f
But with utf8 it doesn't work, of course:
$ echo "одобрение за" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
<D0> <BE> <D0> <B4> <D0> <BE> <D0> <B1> <D1> <80> <D0> <B5> <D0> <BD> <D0> <B8> <D0> <B5> <D0> <B7> <D0> <B0>
because it doesn't know about the 2-byte characters.
It would also be good to know how this (i.e., command-line processing of utf8) is done in Python.
UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding. (There are also UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.)
$octets = encode(ENCODING, $string [, CHECK]) Encodes a string from Perl's internal form into ENCODING and returns a sequence of octets. ENCODING can be either a canonical name or an alias. For encoding names and aliases, see Defining Aliases. For CHECK, see Handling Malformed Data.
UTF-8 (Unicode Transformation–8-bit) is an encoding defined by the International Organization for Standardization (ISO) in ISO 10646. It can represent up to 2,097,152 code points (2^21), more than enough to cover the current 1,112,064 Unicode code points.
The "-C" flag controls some of the Perl Unicode features (see perldoc perlrun
):
$ echo "одобрение за" | perl -C -pe 's/.\K/ /g'
о д о б р е н и е з а
To specify encoding used for stdin/stdout you could use PYTHONIOENCODING
environment variable:
$ echo "одобрение за" | PYTHONIOENCODING=utf-8 python -c'import sys
for line in sys.stdin:
print " ".join(line.decode(sys.stdin.encoding)),
'
о д о б р е н и е з а
If you'd like to split the text on characters (grapheme) boundaries (not on codepoints as the code above) then you could use /\X/
regular expression:
$ echo "одобрение за" | perl -C -pe 's/\X\K/ /g'
о д о б р е н и е з а
See Grapheme Cluster Boundaries
In Python \X
is supported by regex
module.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With