This is the code: <pre class="prettyprint"><code>print '"' + title.decode('utf-8', errors='ignore') + '",' \ ' "' + title.decode('utf-8', errors='ignore') + '", ' \ '"' + desc.decode('utf-8', errors='ignore') + '")' </code></pre> title and desc are returned by Beautiful Soup 3 (p[0].text and p[0].prettify) and as far as I can figure out from BeautifulSoup3 documentation are UTF-8 encoded. If I run <pre class="prettyprint"><code>python.exe script.py > out.txt </code></pre> I get following error: <pre class="prettyprint"><code>Traceback (most recent call last): File "script.py", line 70, in <module> '"' + desc.decode('utf-8', errors='ignore') + '")' UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 264 : ordinal not in range(128) </code></pre> However if I run <pre class="prettyprint"><code>python.exe script.py </code></pre> I get no error. It happens only if output file is specified. How to get good UTF-8 data in the output file?

You can use the codecs module to write unicode data to the file <pre class="prettyprint"><code>import codecs file = codecs.open("out.txt", "w", "utf-8") file.write(something) </code></pre> 'print' outputs to the standart output and if your console doesn't support utf-8 it can cause such error even if you pipe stdout to a file.

Windows behaviour in this case is a bit complicated. You should listen to other advices and do internally use unicode for strings and decode during input. To your question, you need to print encoded strings (only you know which encoding!) in case of stdout redirection, but you have to print unicode strings in case of simple screen output (and python or windows console handles conversion to proper encoding). I recommend to structure your script this way: <pre class="prettyprint"><code># -*- coding: utf-8 -*- import sys, codecs # set up output encoding if not sys.stdout.isatty(): # here you can set encoding for your 'out.txt' file sys.stdout = codecs.getwriter('utf8')(sys.stdout) # next, you will print all strings in unicode print u"Unicode string ě&scaron;čřžý" </code></pre> Update: see also other similar question: Setting the correct encoding when piping stdout in Python

It makes no sense to convert text to unicode in order to print it. Work with your data in unicode, convert it to some encoding for output. What your code does instead: You're on python 2 so your default string type (<code>str</code>) is a bytestring. In your statement you start with some utf-encoded byte strings, convert them to unicode, surround them with quotes (regular <code>str</code> that are coerced to unicode in order to combine into one string). You then pass this unicode string to <code>print</code>, which pushes it to <code>sys.stdout</code>. To do so, it needs to turn it into bytes. If you are writing to the Windows console, it can negotiate somehow, but if you redirect to a regular dumb file, it falls back on ascii and complains because there's no loss-less way to do that. Solution: Don't give <code>print</code> a unicode string. "encode" it yourself to the representation of your choice: <pre class="prettyprint"><code>print "Latin-1:", "unicode über alles!".decode('utf-8').encode('latin-1') print "Utf-8:", "unicode über alles!".decode('utf-8').encode('utf-8') print "Windows:", "unicode über alles!".decode('utf-8').encode('cp1252') </code></pre> All of this should work without complaint when you redirect. It probably won't look right on your screen, but open the output file with Notepad or something and see if your editor is set to see the format. (Utf-8 is the only one that has a hope of being detected. cp1252 is a likely Windows default). Once you get that down, clean up your code and avoid using print for file output. Use the <code>codecs</code> module, and open files with <code>codecs.open</code> instead of plain open. PS. If you're decoding a <code>utf-8</code> string, conversion to unicode should be loss-less: you don't need the <code>errors=ignore</code> flag. That's appropriate when you convert to ascii or Latin-2 or whatever, and you want to just drop characters that don't exist in the target codepage.

Unicode error when outputting python script output to file

Tags:

python

unicode

beautifulsoup

This is the code:

print '"' + title.decode('utf-8', errors='ignore') + '",' \
      ' "' + title.decode('utf-8', errors='ignore') + '", ' \
      '"' + desc.decode('utf-8', errors='ignore') + '")'

title and desc are returned by Beautiful Soup 3 (p[0].text and p[0].prettify) and as far as I can figure out from BeautifulSoup3 documentation are UTF-8 encoded.

If I run

python.exe script.py > out.txt

I get following error:

Traceback (most recent call last):
  File "script.py", line 70, in <module>
    '"' + desc.decode('utf-8', errors='ignore') + '")'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 264
: ordinal not in range(128)

However if I run

python.exe script.py

I get no error. It happens only if output file is specified.

How to get good UTF-8 data in the output file?

367

asked Apr 04 '12 19:04

Kaitnieks

3 Answers

You can use the codecs module to write unicode data to the file

import codecs
file = codecs.open("out.txt", "w", "utf-8")
file.write(something)

'print' outputs to the standart output and if your console doesn't support utf-8 it can cause such error even if you pipe stdout to a file.

124

answered Sep 19 '22 03:09

Maksym Polshcha

Windows behaviour in this case is a bit complicated. You should listen to other advices and do internally use unicode for strings and decode during input.

To your question, you need to print encoded strings (only you know which encoding!) in case of stdout redirection, but you have to print unicode strings in case of simple screen output (and python or windows console handles conversion to proper encoding).

I recommend to structure your script this way:

# -*- coding: utf-8 -*- 
import sys, codecs
# set up output encoding
if not sys.stdout.isatty():
    # here you can set encoding for your 'out.txt' file
    sys.stdout = codecs.getwriter('utf8')(sys.stdout)

# next, you will print all strings in unicode
print u"Unicode string ěščřžý"

Update: see also other similar question: Setting the correct encoding when piping stdout in Python

answered Sep 21 '22 03:09

Jiri

It makes no sense to convert text to unicode in order to print it. Work with your data in unicode, convert it to some encoding for output.

What your code does instead: You're on python 2 so your default string type (str) is a bytestring. In your statement you start with some utf-encoded byte strings, convert them to unicode, surround them with quotes (regular str that are coerced to unicode in order to combine into one string). You then pass this unicode string to print, which pushes it to sys.stdout. To do so, it needs to turn it into bytes. If you are writing to the Windows console, it can negotiate somehow, but if you redirect to a regular dumb file, it falls back on ascii and complains because there's no loss-less way to do that.

Solution: Don't give print a unicode string. "encode" it yourself to the representation of your choice:

print "Latin-1:", "unicode über alles!".decode('utf-8').encode('latin-1')
print "Utf-8:", "unicode über alles!".decode('utf-8').encode('utf-8')
print "Windows:", "unicode über alles!".decode('utf-8').encode('cp1252')

All of this should work without complaint when you redirect. It probably won't look right on your screen, but open the output file with Notepad or something and see if your editor is set to see the format. (Utf-8 is the only one that has a hope of being detected. cp1252 is a likely Windows default).

Once you get that down, clean up your code and avoid using print for file output. Use the codecs module, and open files with codecs.open instead of plain open.

PS. If you're decoding a utf-8 string, conversion to unicode should be loss-less: you don't need the errors=ignore flag. That's appropriate when you convert to ascii or Latin-2 or whatever, and you want to just drop characters that don't exist in the target codepage.

answered Sep 21 '22 03:09

alexis

Related questions
                            
                                Recommended way to implement __eq__ and __hash__
                            
                                ModuleNotFoundError: No module named 'BaseHTTPServer'
                            
                                python a,b = b,a implementation? How is it different from C++ swap function?
                            
                                VSCode: The term 'python' is not recognized...but py works
                            
                                Python and Dart Integration in Flutter Mobile Application
                            
                                PyTorch: What's the difference between state_dict and parameters()?
                            
                                Use Python Pool with context manager or close and join
                            
                                pytorch RuntimeError: Expected object of scalar type Double but got scalar type Float
                            
                                Spark: Why does Python significantly outperform Scala in my use case?
                            
                                How do I reply to an email using the Python imaplib and include the original message?
                            
                                Simple multilingual CMS? [closed]
                            
                                python regex match and replace
                            
                                How can I check if a point is below a line or not ?
                            
                                How to draw a line outside of an axis in matplotlib (in figure coordinates)
                            
                                How can I implement a secure WebSocket (wss://) server in Python?
                            
                                Comprehensive list of Python protocols/interfaces
                            
                                Lazy loading of columns in sqlalchemy
                            
                                Multiple context `with` statement in Python 2.6
                            
                                mod_wsgi isn't honoring WSGIPythonHome
                            
                                RabbitMQ, Pika and reconnection strategy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With