I'm using Python 2.5. What is going on here? What have I misunderstood? How can I fix it?
in.txt:
Stäckövérfløw
code.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print """Content-Type: text/plain; charset="UTF-8"\n"""
f = open('in.txt','r')
for line in f:
print line
for i in line:
print i,
f.close()
output:
Stäckövérfløw
S t � � c k � � v � � r f l � � w
UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL). The stated objective of the Unicode consortium is to encompass all communications.
To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.
UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters. As of the mid 2020s, UTF-8 is one of the most popular encoding systems.
for i in line:
print i,
When you read the file, the string you read in is a string of bytes. The for loop iterates over a single byte at a time. This causes problems with a UTF-8 encoded string, where non-ASCII characters are represented by multiple bytes. If you want to work with Unicode objects, where the characters are the basic pieces, you should use
import codecs
f = codecs.open('in', 'r', 'utf8')
If sys.stdout
doesn't already have the appropriate encoding set, you may have to wrap it:
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
Use codecs.open instead, it works for me.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print """Content-Type: text/plain; charset="UTF-8"\n"""
f = codecs.open('in','r','utf8')
for line in f:
print line
for i in line:
print i,
f.close()
Check this out:
# -*- coding: utf-8 -*-
import pprint
f = open('unicode.txt','r')
for line in f:
print line
pprint.pprint(line)
for i in line:
print i,
f.close()
It returns this:
Stäckövérfløw
'St\xc3\xa4ck\xc3\xb6v\xc3\xa9rfl\xc3\xb8w'
S t ? ? c k ? ? v ? ? r f l ? ? w
The thing is that the file is just being read as a string of bytes. Iterating over them splits the multibyte characters into nonsensical byte values.
print c,
Adds a "blank charrecter" and breaks correct utf-8 sequences into incorrect one. So this would not work unless you write a signle byte to output
sys.stdout.write(i)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With