Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterating through a unicode string in Python

I've got an issue with iterating through unicode strings, character by character, with python.

print "w: ",word
for c in word:
    print "word: ",c

This is my output

w:  文本
word:  ? 
word:  ?
word:  ?
word:  ?
word:  ?
word:  ?

My desired output is:

文
本

When I use len(word) I get 6. Apparently each character is 3 unicode chunks.

So, my unicode string is successfully stored in the variable, but I cannot get the characters out. I have tried using encode('utf-8'), decode('utf-8) and codecs but still cannot obtain any good results. This seems like a simple problem but is frustratingly hard for me.

Hope someone can point me to the right direction.

Thanks!

like image 931
charpi Avatar asked Jun 22 '15 03:06

charpi


People also ask

How do I get Unicode of a string in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.

Can you iterate through strings Python?

Looping through a string One way to iterate over a string is to use for i in range(len(str)): . In this loop, the variable i receives the index so that each character can be accessed using str[i] .

Can Python handle Unicode?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.

How do you escape Unicode in Python?

In Python source code, Unicode literals are written as strings prefixed with the 'u' or 'U' character: u'abcdefghijk' . Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point.


2 Answers

# -*- coding: utf-8 -*-
word = "文本"
print(word)
for each in unicode(word,"utf-8"):
    print(each)

Output:

文本
文
本
like image 80
Pruthvi Raj Avatar answered Oct 19 '22 23:10

Pruthvi Raj


The code I used which works is this

fileContent = codecs.open('fileName.txt','r',encoding='utf-8')
#...split by whitespace to get words..
for c in word:
        print(c.encode('utf-8'))
like image 40
charpi Avatar answered Oct 20 '22 01:10

charpi