I'm using Python 2.5. What is going on here? What have I misunderstood? How can I fix it? in.txt: <pre class="prettyprint"><code>Stäckövérfløw </code></pre> code.py <pre class="prettyprint"><code>#!/usr/bin/env python # -*- coding: utf-8 -*- print """Content-Type: text/plain; charset="UTF-8"\n""" f = open('in.txt','r') for line in f: print line for i in line: print i, f.close() </code></pre> output: <pre class="prettyprint"><code>Stäckövérfløw S t � � c k � � v � � r f l � � w </code></pre>

Check this out: <pre class="prettyprint"><code># -*- coding: utf-8 -*- import pprint f = open('unicode.txt','r') for line in f: print line pprint.pprint(line) for i in line: print i, f.close() </code></pre> It returns this: Stäckövérfløw 'St\xc3\xa4ck\xc3\xb6v\xc3\xa9rfl\xc3\xb8w' S t ? ? c k ? ? v ? ? r f l ? ? w The thing is that the file is just being read as a string of bytes. Iterating over them splits the multibyte characters into nonsensical byte values.

UTF-8 problem in python when reading chars

Tags:

python

utf-8

I'm using Python 2.5. What is going on here? What have I misunderstood? How can I fix it?

in.txt:

Stäckövérfløw

code.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
print """Content-Type: text/plain; charset="UTF-8"\n"""
f = open('in.txt','r')
for line in f:
    print line
    for i in line:
        print i,
f.close()

output:

Stäckövérfløw

S t � � c k � � v � � r f l � � w

967

asked Jun 12 '09 07:06

jacob

4 Answers

for i in line:
    print i,

When you read the file, the string you read in is a string of bytes. The for loop iterates over a single byte at a time. This causes problems with a UTF-8 encoded string, where non-ASCII characters are represented by multiple bytes. If you want to work with Unicode objects, where the characters are the basic pieces, you should use

import codecs
f = codecs.open('in', 'r', 'utf8')

If sys.stdout doesn't already have the appropriate encoding set, you may have to wrap it:

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

answered Oct 02 '22 03:10

Miles

Use codecs.open instead, it works for me.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
print """Content-Type: text/plain; charset="UTF-8"\n"""
f = codecs.open('in','r','utf8')
for line in f:
    print line
    for i in line:
        print i,
f.close()

answered Oct 04 '22 03:10

mhawke

Check this out:

# -*- coding: utf-8 -*-
import pprint
f = open('unicode.txt','r')
for line in f:
    print line
    pprint.pprint(line)
    for i in line:
        print i,
f.close()

It returns this:

Stäckövérfløw
'St\xc3\xa4ck\xc3\xb6v\xc3\xa9rfl\xc3\xb8w'
S t ? ? c k ? ? v ? ? r f l ? ? w

The thing is that the file is just being read as a string of bytes. Iterating over them splits the multibyte characters into nonsensical byte values.

answered Oct 03 '22 03:10

mikl

print c,

Adds a "blank charrecter" and breaks correct utf-8 sequences into incorrect one. So this would not work unless you write a signle byte to output

sys.stdout.write(i)

answered Oct 04 '22 03:10

Artyom

Related questions
                            
                                Flask restful pagination
                            
                                Replace <br> with space in BeautifulSoap output
                            
                                Firebase Firestore: get generated ID of document (Python)
                            
                                Pip install killed - out of memory - how to get around it?
                            
                                Datetime issue with matplotlib
                            
                                module 'pandas' has no attribute 'tslib'
                            
                                numpy.polyfit vs numpy.polynomial.polynomial.polyfit
                            
                                Chrome 80 how to decode cookies
                            
                                Iterating over dictionary using __getitem__ in python [duplicate]
                            
                                FastAPI variable query parameters
                            
                                Input image dtype is bool. Interpolation is not defined with bool data type
                            
                                Plot Confusion Matrix for multilabel Classifcation Python
                            
                                Mysql 'VALUES function' is deprecated
                            
                                Python Walrus Operator in While Loops
                            
                                Add comma sepated values inside a column
                            
                                How to render SVG image to PNG file in Python?
                            
                                How to capture Python interpreter's and/or CMD.EXE's output from a Python script?
                            
                                Closures in Python
                            
                                Fast way to determine if a PID exists on (Windows)?
                            
                                Getting friends within a specified degree of separation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With