Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I tell Python that sys.argv is in Unicode?

Here is a little program:

import sys

f = sys.argv[1]
print type(f)
print u"f=%s" % (f)

Here is my running of the program:

$ python x.py 'Recent/רשימת משתתפים.LNK'
<type 'str'>
Traceback (most recent call last):
  File "x.py", line 5, in <module>
    print u"f=%s" % (f)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 7: ordinal not in range(128)
$ 

The problem is that sys.argv[1] is thinking that it's getting an ascii string, which it can't convert to Unicode. But I'm using a Mac with a full Unicode-aware Terminal, so x.py is actually getting a Unicode string. How do I tell Python that sys.argv[] is Unicode and not Ascii? Failing that, how do I convert ASCII (that has unicode inside it) into Unicode? The obvious conversions don't work.

like image 767
vy32 Avatar asked Feb 25 '11 04:02

vy32


People also ask

How do you code Unicode in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.

What Python object type is sys argv?

argv with examples. sys. argv is a list in Python that contains all the command-line arguments passed to the script. It is essential in Python while working with Command Line arguments.

How does Python handle Unicode?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.

Does Python use Unicode or Ascii?

1. Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII.


1 Answers

The UnicodeDecodeError error you see is due to you're mixing the Unicode string u"f=%s" and the sys.argv[1] bytestring:

  • both bytestrings:

      $ python2 -c'import sys; print "f=%s" % (sys.argv[1],)' 'Recent/רשימת משתתפים'
    

    This passes bytes transparently from/to your terminal. It works for any encoding.

  • both Unicode:

      $ python2 -c'import sys; print u"f=%s" % (sys.argv[1].decode("utf-8"),)' 'Rec..
    

    Here you should replace 'utf-8' by the encoding your terminal uses. You might use sys.getfilesystemencoding() here if the terminal is not Unicode-aware.

Both commands produce the same output:

f=Recent/רשימת משתתפים

In general you should convert bytestrings that you consider to be text to Unicode as soon as possible.

like image 89
jfs Avatar answered Sep 30 '22 11:09

jfs