I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using <code>'w+', "utf-8"</code> arguments. However, when I try to <code>x.write(string)</code> I get the <code>UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)</code> I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that... Any suggestions? EDIT: Actual code here: <pre class="prettyprint"><code>source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8") outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8") x = str(actionT(splitList[0], splitList[1])) outTarget.write(x) </code></pre> Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this: <code>[日木曜 Deliverables]= CASE WHEN things = 11 THEN C ELSE 0 END</code>

In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with <code>u</code>: <pre class="prettyprint"><code># byte string >>> 'abc' 'abc' # unicode string: >>> u'abc абв' u'abc \u0430\u0431\u0432' </code></pre> 'abc' chars are the same, because the are in ASCII range. <code>\u0430</code> is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string): <pre class="prettyprint"><code>>>> s = u'abc абв' >>> s.encode('utf8') 'abc \xd0\xb0\xd0\xb1\xd0\xb2' </code></pre> This encoded string now can be written to file: <pre class="prettyprint"><code>>>> s = u'abc абв' >>> with open('text.txt', 'w+') as f: ... f.write(s.encode('utf8')) </code></pre> Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding: <pre class="prettyprint"><code>>>> with open('text.txt', 'r') as f: ... content = f.read() >>> content 'abc \xd0\xb0\xd0\xb1\xd0\xb2' </code></pre> You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name: <pre class="prettyprint"><code>>>> content.decode('utf8') u'abc \u0430\u0431\u0432' </code></pre> After decode, we've got back our unicode string with unicode code points. <pre class="prettyprint"><code>>>> print content.decode('utf8') abc абв </code></pre>

Python, Encoding output to UTF-8

Tags:

python

encoding

utf-8

python-2.7

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.

However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)

I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...

Any suggestions?

EDIT: Actual code here:

source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)

Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:

[日木曜 Deliverables]= CASE WHEN things = 11 THEN C ELSE 0 END

449

asked Jul 10 '13 18:07

Razzle Dazzle

2 Answers

Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.

As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.

Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).

Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.

101

answered Oct 07 '22 12:10

JAB

In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:

# byte string
>>> 'abc'
'abc'

# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'

'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):

>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'

This encoded string now can be written to file:

>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
...     f.write(s.encode('utf8'))

Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:

>>> with open('text.txt', 'r') as f:
...     content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'

You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:

>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'

After decode, we've got back our unicode string with unicode code points.

>>> print content.decode('utf8')
abc абв

answered Oct 07 '22 10:10

stalk

Related questions
                            
                                Negative axis in a log plot
                            
                                Convert Python object to C void type
                            
                                How to test command line scripts with nose?
                            
                                How to base64 encode/decode a variable with string type in Python 3?
                            
                                Python - print string to screen, include \n in output [duplicate]
                            
                                pygame requires keyboard interrupt to init display
                            
                                all possible phase combination
                            
                                Asterisks outside of function calls
                            
                                PyMongo $inc having issues
                            
                                how to prevent werkzeug from logging
                            
                                Function definition like range
                            
                                Regex nested parenthesis in python
                            
                                horizontal tree with graphviz_layout
                            
                                Numpy, why does `x += y` produce a different result than `x = x + y`? [duplicate]
                            
                                Python: Why does this code execute?
                            
                                How to handle nested parentheses with regex?
                            
                                Using text inputs in pygame
                            
                                Can Lupa be used to run untrusted lua code in python?
                            
                                Python dictionary comprehension example
                            
                                how to reverse a regex in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With