How to handle utf8 on the command line (using Perl or Python)?

Tags:

How can I handle utf8 using Perl (or Python) on the command line?

I am trying to split the characters in each word, for example. This is very easy for non-utf8 text, for example:

$ echo "abc def" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
a b c   d e f

But with utf8 it doesn't work, of course:

$ echo "одобрение за" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
<D0> <BE> <D0> <B4> <D0> <BE> <D0> <B1> <D1> <80> <D0> <B5> <D0> <BD> <D0> <B8> <D0> <B5>   <D0> <B7> <D0> <B0>

because it doesn't know about the 2-byte characters.

It would also be good to know how this (i.e., command-line processing of utf8) is done in Python.

363

asked Mar 16 '12 01:03

Frank

1 Answers

The "-C" flag controls some of the Perl Unicode features (see perldoc perlrun):

$ echo "одобрение за" | perl -C -pe 's/.\K/ /g'
о д о б р е н и е   з а

To specify encoding used for stdin/stdout you could use PYTHONIOENCODING environment variable:

$ echo "одобрение за" | PYTHONIOENCODING=utf-8 python -c'import sys
for line in sys.stdin:
    print " ".join(line.decode(sys.stdin.encoding)),
'
о д о б р е н и е   з а

If you'd like to split the text on characters (grapheme) boundaries (not on codepoints as the code above) then you could use /\X/ regular expression:

$ echo "одобрение за" | perl -C -pe 's/\X\K/ /g'
о д о б р е н и е   з а

See Grapheme Cluster Boundaries

In Python \X is supported by regex module.

187

answered Sep 19 '22 06:09

jfs

Related questions
                            
                                Splitting a list into uneven groups?
                            
                                How to measure the speed of a python function
                            
                                Creating a Gin Index with Trigram (gin_trgm_ops) in Django model
                            
                                How to re-partition pyspark dataframe?
                            
                                KeyError when loading pickled scikit-learn model using joblib
                            
                                Why can't I get reproducible results in Keras even though I set the random seeds?
                            
                                Create random shape/contour using matplotlib
                            
                                How to fix AttributeError: 'Series' object has no attribute 'to_numpy'
                            
                                Why are SQL aggregate functions so much slower than Python and Java (or Poor Man's OLAP)
                            
                                Python Threads - Critical Section
                            
                                The Assignment Problem, a NumPy function?
                            
                                Path separator char in python 2.4
                            
                                Method to peek at a Python program running right now
                            
                                Python html parsing that actually works
                            
                                How can this function be rewritten to implement OrderedDict? [duplicate]
                            
                                Disable Django Debugging for Celery
                            
                                Running Python script as root (with sudo) - what is the username of the effective user?
                            
                                Can I have a dictionary with same-name keys?
                            
                                Getting an actual return value for a mocked file.read()
                            
                                Iterate over an object's "public" attributes [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to handle utf8 on the command line (using Perl or Python)?

Tags:

python

utf-8

perl

Frank

People also ask

1 Answers

jfs

Recent Activity

Donate For Us