I have two python dictionaries containing information about japanese words and characters:
kanjiDic : contains kanji ( single japanese character ), key: kanji, value: dictionary with information about it
Now I would like to iterate through each character of each word in the vocabDic and look up this character in the kanji dictionary. My goal is to create a csv file which I can then import into a database as join table for vocabulary and kanji.
My Python version is 2.6
My code is as following:
kanjiVocabJoinWriter = csv.writer(open('kanjiVocabJoin.csv', 'wb'), delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
kanjiVocabJoinCount = 1
#loop through dictionary
for key, val in vocabDic.iteritems():
if val['lang'] is 'jpn': # only check japanese words
vocab = val['text']
print vocab
# loop through vocab string
for v in vocab:
test = kanjiDic.get(v)
print v
print test
if test is not None:
print str(kanjiVocabJoinCount)+','+str(test['id'])+','+str(val['id'])
kanjiVocabJoinWriter([str(kanjiVocabJoinCount),str(test['id']),str(val['id'])])
kanjiVocabJoinCount = kanjiVocabJoinCount+1
If I print the variables to the command line, I get:
vocab : works, prints in japanese
v ( one character of the vocab in the for loop ) : �
test ( character looked up in the kanjiDic ) : None
To me it seems like the for loop messes the encoding up.
I tried various functions ( decode, encode.. ) but no luck so far.
Any ideas on how I could get this working?
Help would be very much appreciated.
From your description of the problem, it sounds like vocab
is an encoded str
object, not a unicode
object.
For concreteness, suppose vocab
equals u'債務の天井'
encoded in utf-8
:
In [42]: v=u'債務の天井'
In [43]: vocab=v.encode('utf-8') # val['text']
Out[43]: '\xe5\x82\xb5\xe5\x8b\x99\xe3\x81\xae\xe5\xa4\xa9\xe4\xba\x95'
If you loop over the encoded str
object, you get one byte at a time: \xe5
, then \x82
, then \xb5
, etc.
However if you loop over the unicode object, you'd get one unicode character at a time:
In [45]: for v in u'債務の天井':
....: print(v)
債
務
の
天
井
Note that the first unicode character, encoded in utf-8
, is 3 bytes:
In [49]: u'債'.encode('utf-8')
Out[49]: '\xe5\x82\xb5'
That's why looping over the bytes, printing one byte at a time, (e.g. print \xe5
) fails to print a recognizable character.
So it looks like you need to decode your str
objects and work with unicode
objects. You didn't mention what encoding you are using for your str
objects. If it is utf-8
, then you'd decode it like this:
vocab=val['text'].decode('utf-8')
If you are not sure what encoding val['text']
is in, post the output of
print(repr(vocab))
and maybe we can guess the encoding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With