Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correctly reading text from Windows-1252(cp1252) file in python

so okay, as the title suggests the problem I have is with correctly reading input from a windows-1252 encoded file in python and inserting said input into SQLAlchemy-MySql table.

The current system setup:
Windows 7 VM with "Roger Access Control System" which outputs the file;
Ubuntu 12.04 LTS VM with a shared-folder to the Windows system so I can access the file, using "Python 2.7.3".

Now to the actual problem, for the input file I have a "VM shared-folder" that contains a file that is genereate on a Windows 7 system through Roger Access Control System(roger.pl for more details), this file is called "PREvents.csv" which suggests to it's contents, a ";" seperated list of data.

An example format of the data:

2013-03-19;15:58:30;100;Jānis;Dumburs;1;Uznemums1;0;Ieeja; 2013-03-19;15:58:40;100;Jānis;Dumburs;1;Uznemums1;2;Izeja; 

The 4th field contains the card owners name and 5th contains the owners lastname, the 6th contains the owners assigned group.

The issue comes from the fact that any one of the 3 above mentioned fields can contain characters specific to Latvian language, in the example file the word "Jānis" contains the letter "ā" which in unicode is 257.

As I'm used to, I open the file as such:

try:     f = codecs.open(file, 'rb', 'cp1252') except IOError:     f = codecs.open(file, 'wb', 'cp1252') 

So far, everything works - it opens the file and so I move on to iterate over each line of the file(this is a continuos running script so pardon the loop):

while True:     line = f.readline()      if not line:         # Pause loop for 1 second         time.sleep(1)     else:         # Split the line into list         date, timed, userid, firstname, lastname, groupid, groupname, typed, pointname, empty = line.split(';') 

And this is where the issues start, if I print repr(firstname) it prints u'J\xe2nis' which is, as far as I undestand, not correct - `\xe2\ does not represent the Latvian character "ā".
Further down the loop depending on event type I assign the variables to SQLAlchemy object and insert/update:

if typed == '0':  # Entry type     event = Events(         period,         fullname,         userid,         groupname,         timestamp,         0,         0     )     session.add(event) else:  # Exit type     event = session.query(Events).filter(         Events.period == period,         Events.exit == 0,         Events.userid == userid     ).first()     if event is not None:         event.exit = timestamp         event.spent = timestamp - event.entry  # Commit changes to database session.commit() 

In my search for answers I've found how to define the default encoding to use:

import sys reload(sys) sys.setdefaultencoding('utf-8') 

Which hasn't helped me in any way.

Basically, this is all leads to the me not being able to insert the correct owners First/last name aswell as owners assigned groupname if they contain any of Latvian-specific characters, for example:

Instead of the character "ā" it inserts "â" 

I'd also like to add that I cannot change the "PREvents.csv" file encoding and the "RACS" system does not support inserting into UTF-8 or Unicode files - if you try either way, the system inserts random symbols for the Latvian-specific characters.

Please let me now if any other information is needed, I'll gladly provide it :)

Any help would be highly appreciated.

like image 754
Krisjanis Zvaigzne Avatar asked Mar 19 '13 14:03

Krisjanis Zvaigzne


People also ask

What is the difference between UTF-8 and Windows-1252 encoding?

Windows-1252 is a subset of UTF-8 in terms of 'what characters are available', but not in terms of their byte-by-byte representation. Windows-1252 has characters between bytes 127 and 255 that UTF-8 has a different encoding for. Any visible character in the ASCII range (127 and below) are encoded 1:1 in UTF-8.

What is encoding Windows-1252?

Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German.

What is Python's default encoding?

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding. (There are also UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.)


Video Answer


2 Answers

CP1252 cannot represent ā; your input contains the similar character â. repr just displays an ASCII representation of a unicode string in Python 2.x:

>>> print(repr(b'J\xe2nis'.decode('cp1252'))) u'J\xe2nis' >>> print(b'J\xe2nis'.decode('cp1252')) Jânis 
like image 85
phihag Avatar answered Sep 17 '22 19:09

phihag


I think u'J\xe2nis' is correct, see:

>>> print u'J\xe2nis'.encode('utf-8') Jânis 

Are you getting actual errors from SQLAlchemy or in your application's output?

like image 40
djc Avatar answered Sep 20 '22 19:09

djc