Correctly reading text from Windows-1252(cp1252) file in python

Tags:

so okay, as the title suggests the problem I have is with correctly reading input from a windows-1252 encoded file in python and inserting said input into SQLAlchemy-MySql table.

The current system setup:
Windows 7 VM with "Roger Access Control System" which outputs the file;
Ubuntu 12.04 LTS VM with a shared-folder to the Windows system so I can access the file, using "Python 2.7.3".

Now to the actual problem, for the input file I have a "VM shared-folder" that contains a file that is genereate on a Windows 7 system through Roger Access Control System(roger.pl for more details), this file is called "PREvents.csv" which suggests to it's contents, a ";" seperated list of data.

An example format of the data:

2013-03-19;15:58:30;100;Jānis;Dumburs;1;Uznemums1;0;Ieeja; 2013-03-19;15:58:40;100;Jānis;Dumburs;1;Uznemums1;2;Izeja;

The 4th field contains the card owners name and 5th contains the owners lastname, the 6th contains the owners assigned group.

The issue comes from the fact that any one of the 3 above mentioned fields can contain characters specific to Latvian language, in the example file the word "Jānis" contains the letter "ā" which in unicode is 257.

As I'm used to, I open the file as such:

try:     f = codecs.open(file, 'rb', 'cp1252') except IOError:     f = codecs.open(file, 'wb', 'cp1252')

So far, everything works - it opens the file and so I move on to iterate over each line of the file(this is a continuos running script so pardon the loop):

while True:     line = f.readline()      if not line:         # Pause loop for 1 second         time.sleep(1)     else:         # Split the line into list         date, timed, userid, firstname, lastname, groupid, groupname, typed, pointname, empty = line.split(';')

And this is where the issues start, if I print repr(firstname) it prints u'J\xe2nis' which is, as far as I undestand, not correct - `\xe2\ does not represent the Latvian character "ā".
Further down the loop depending on event type I assign the variables to SQLAlchemy object and insert/update:

if typed == '0':  # Entry type     event = Events(         period,         fullname,         userid,         groupname,         timestamp,         0,         0     )     session.add(event) else:  # Exit type     event = session.query(Events).filter(         Events.period == period,         Events.exit == 0,         Events.userid == userid     ).first()     if event is not None:         event.exit = timestamp         event.spent = timestamp - event.entry  # Commit changes to database session.commit()

In my search for answers I've found how to define the default encoding to use:

import sys reload(sys) sys.setdefaultencoding('utf-8')

Which hasn't helped me in any way.

Basically, this is all leads to the me not being able to insert the correct owners First/last name aswell as owners assigned groupname if they contain any of Latvian-specific characters, for example:

Instead of the character "ā" it inserts "â"

I'd also like to add that I cannot change the "PREvents.csv" file encoding and the "RACS" system does not support inserting into UTF-8 or Unicode files - if you try either way, the system inserts random symbols for the Latvian-specific characters.

Please let me now if any other information is needed, I'll gladly provide it :)

Any help would be highly appreciated.

754

asked Mar 19 '13 14:03

Krisjanis Zvaigzne

Video Answer

2 Answers

CP1252 cannot represent ā; your input contains the similar character â. repr just displays an ASCII representation of a unicode string in Python 2.x:

>>> print(repr(b'J\xe2nis'.decode('cp1252'))) u'J\xe2nis' >>> print(b'J\xe2nis'.decode('cp1252')) Jânis

answered Sep 17 '22 19:09

phihag

I think u'J\xe2nis' is correct, see:

>>> print u'J\xe2nis'.encode('utf-8') Jânis

Are you getting actual errors from SQLAlchemy or in your application's output?

answered Sep 20 '22 19:09

djc

Related questions
                            
                                Windows: Z3Exception("init(Z3_LIBRARY_PATH) must be invoked before using Z3-python")
                            
                                Idiomatic way to collect & report multiple exceptions in Python
                            
                                How can I configure Pylint to check all things PEP8 checks?
                            
                                Importing local module (python script) in Airflow DAG
                            
                                Is there any way to get vim to auto wrap python strings at 79 chars?
                            
                                Pygame water ripple effect
                            
                                Is there any working memory profiler for Python3 [closed]
                            
                                Python program hangs forever when called from subprocess
                            
                                Python dependency hell: A compromise between virtualenv and global dependencies?
                            
                                Python alignment of assignments (style) [closed]
                            
                                Can I control the architecture (32bit vs 64bit) when building a pyinstaller executable?
                            
                                What's the difference between loop.create_task, asyncio.async/ensure_future and Task?
                            
                                What is the difference between mocking and monkey patching?
                            
                                Is python package virtualenv necessary when I use python 3.3?
                            
                                ANOVA in python using pandas dataframe with statsmodels or scipy?
                            
                                How to work with multiple inputs for LSTM in Keras?
                            
                                Implementation of Repository Pattern in Python?
                            
                                Run Rsync from Python [duplicate]
                            
                                whats the difference between python 3.3 and 3.3m [duplicate]
                            
                                Any experiences with Protocol Buffers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Correctly reading text from Windows-1252(cp1252) file in python

Tags:

python

python-unicode

utf-8

windows-1252

cp1252

Krisjanis Zvaigzne

People also ask

Video Answer

2 Answers

phihag

djc

Recent Activity

Donate For Us