I've got an issue with iterating through unicode strings, character by character, with python.
print "w: ",word
for c in word:
print "word: ",c
This is my output
w: 文本
word: ?
word: ?
word: ?
word: ?
word: ?
word: ?
My desired output is:
文
本
When I use len(word) I get 6. Apparently each character is 3 unicode chunks.
So, my unicode string is successfully stored in the variable, but I cannot get the characters out. I have tried using encode('utf-8'), decode('utf-8) and codecs but still cannot obtain any good results. This seems like a simple problem but is frustratingly hard for me.
Hope someone can point me to the right direction.
Thanks!
In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.
Looping through a string One way to iterate over a string is to use for i in range(len(str)): . In this loop, the variable i receives the index so that each character can be accessed using str[i] .
Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.
In Python source code, Unicode literals are written as strings prefixed with the 'u' or 'U' character: u'abcdefghijk' . Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point.
# -*- coding: utf-8 -*-
word = "文本"
print(word)
for each in unicode(word,"utf-8"):
print(each)
Output:
文本
文
本
The code I used which works is this
fileContent = codecs.open('fileName.txt','r',encoding='utf-8')
#...split by whitespace to get words..
for c in word:
print(c.encode('utf-8'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With