Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python 2.7 encoding decoding

I have a problem involving encoding/decoding. I read text from file and compare it with text from database (Postgres) Compare is done within two lists

from file i get "jo\x9a" for "još" and from database I get "jo\xc5\xa1" for same value

common = [a for a in codes_from_file if a in kode_prfoksov]

# Items in one but not the other
only1 = [a for a in codes_from_file if not a in kode_prfoksov]

#Items only in another
only2 = [a for a in kode_prfoksov if not a in codes_from_file ]

How to solve this? Which encoding should be set when comparing this two strings to solve the issue?

thank you

like image 852
Yebach Avatar asked Mar 21 '12 09:03

Yebach


2 Answers

The first one seems to be windows-1250, and the second is utf-8.

>>> print 'jo\x9a'.decode('windows-1250')
još
>>> print 'jo\xc5\xa1'.decode('utf-8')
još
>>> 'jo\x9a'.decode('windows-1250') == 'jo\xc5\xa1'.decode('utf-8')
True
like image 186
stranac Avatar answered Nov 16 '22 07:11

stranac


Your file strings seems to be Windows-1250 encoded. Your database seems to contain UTF-8 strings.

So you can either convert first all strings to unicode:

codes_from_file = [a.decode("windows-1250") for a in codes_from_file]
kode_prfoksov]  = [a.decode("utf-8") for a in codes_from_file]

or if you do not want unicode strings, just convert the file string to UTF-8:

codes_from_file = [a.decode("windows-1250").encode("utf-8") for a in codes_from_file]
like image 4
jofel Avatar answered Nov 16 '22 05:11

jofel