Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

iterate through unicode strings and compare with unicode in python dictionary

Tags:

python

unicode


I have two python dictionaries containing information about japanese words and characters:

  1. vocabDic : contains vocabulary, key: word, value: dictionary with information about it
  2. kanjiDic : contains kanji ( single japanese character ), key: kanji, value: dictionary with information about it

    Now I would like to iterate through each character of each word in the vocabDic and look up this character in the kanji dictionary. My goal is to create a csv file which I can then import into a database as join table for vocabulary and kanji.
    My Python version is 2.6
    My code is as following:

    kanjiVocabJoinWriter = csv.writer(open('kanjiVocabJoin.csv', 'wb'), delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
    kanjiVocabJoinCount = 1
    
    #loop through dictionary
    for key, val in vocabDic.iteritems():
        if val['lang'] is 'jpn': # only check japanese words
            vocab = val['text']
            print vocab
            # loop through vocab string
            for v in vocab:
                 test = kanjiDic.get(v)
                 print v
                 print test
                 if test is not None:
                    print str(kanjiVocabJoinCount)+','+str(test['id'])+','+str(val['id'])
                    kanjiVocabJoinWriter([str(kanjiVocabJoinCount),str(test['id']),str(val['id'])])
                    kanjiVocabJoinCount = kanjiVocabJoinCount+1
    

If I print the variables to the command line, I get:
vocab : works, prints in japanese
v ( one character of the vocab in the for loop ) : �
test ( character looked up in the kanjiDic ) : None

To me it seems like the for loop messes the encoding up.
I tried various functions ( decode, encode.. ) but no luck so far.
Any ideas on how I could get this working?
Help would be very much appreciated.

like image 652
daniela Avatar asked Aug 07 '11 17:08

daniela


1 Answers

From your description of the problem, it sounds like vocab is an encoded str object, not a unicode object.

For concreteness, suppose vocab equals u'債務の天井' encoded in utf-8:

In [42]: v=u'債務の天井'
In [43]: vocab=v.encode('utf-8')   # val['text']
Out[43]: '\xe5\x82\xb5\xe5\x8b\x99\xe3\x81\xae\xe5\xa4\xa9\xe4\xba\x95'

If you loop over the encoded str object, you get one byte at a time: \xe5, then \x82, then \xb5, etc.

However if you loop over the unicode object, you'd get one unicode character at a time:

In [45]: for v in u'債務の天井':
   ....:     print(v)    
債
務
の
天
井

Note that the first unicode character, encoded in utf-8, is 3 bytes:

In [49]: u'債'.encode('utf-8')
Out[49]: '\xe5\x82\xb5'

That's why looping over the bytes, printing one byte at a time, (e.g. print \xe5) fails to print a recognizable character.

So it looks like you need to decode your str objects and work with unicode objects. You didn't mention what encoding you are using for your str objects. If it is utf-8, then you'd decode it like this:

vocab=val['text'].decode('utf-8')

If you are not sure what encoding val['text'] is in, post the output of

print(repr(vocab))

and maybe we can guess the encoding.

like image 142
unutbu Avatar answered Oct 23 '22 05:10

unutbu