Suppose I have a mysterious unicode string in Python (2.7) that I want to feed to a command line program such as imagemagick (or really just get it out of Python in any way). The strings might be:
So in Python I might make a little command like this:
cmd = u'convert -pointsize 24 label:"%s" "%s.png"' % (name, name)
If I just print cmd
and get convert -pointsize 24 label:"Jörgen Jönsson" "Jörgen Jönsson.png"
and then run it myself, everything is fine.
But if I do os.system( cmd )
, I get this:
I know it's not an imagemagick problem because the filenames are messed up too. I know that Python is converting the command to ascii when it passes it off to os.system, but why is it getting the encoding so wrong? Why is it interpreting each non-ASCII character as 2 characters? According to a few articles that I've read, it might be because it's encoded as latin-1 but it's being read as utf-8, but I've tried encoding it back and forth between them and it's not helping.
I get Unicode exceptions when I try to just encode it manually as ascii without a replacement argument, but if I do name.encode('ascii','xmlcharrefreplace'), I get the following:
I'm hoping that someone recognizes this particular kind of encoding problem and can offer some advice, because I'm about out of ideas.
Thanks!
Usually this is implemented by converting the Unicode string into some encoding that varies depending on the system. Today Python is converging on using UTF-8: Python on MacOS has used UTF-8 for several versions, and Python 3.6 switched to using UTF-8 on Windows as well.
The default encoding for Python source code is UTF-8, so you can simply include a Unicode character in a string literal: try: with open('/tmp/input.txt', 'r') as f:... except OSError: # 'File not found' error message. print("Fichier non trouvé") Side note: Python 3 also supports using Unicode characters in identifiers:
Normal strings in Python are stored internally as 8-bit ASCII, while Unicode strings are stored as 16-bit Unicode. This allows for a more varied set of characters, including special characters from most languages in the world. I'll restrict my treatment of Unicode strings to the following −
Some encodings have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859’ are all synonyms for the same encoding. One-character Unicode strings can also be created with the chr() built-in function, which takes integers and returns a Unicode string of length 1 that contains the corresponding code point.
Use subprocess.call instead:
>>> s = u'Jörgen Jönsson'
>>> import subprocess
>>> subprocess.call(['echo', s])
Jörgen Jönsson
0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With