Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting the correct encoding for strings and csv-files in Python

I'm using mechanize in Python to grab some data from a website and send it new data.

The thing is that the site is in French, so I get question marks in a diamond shape (�) instead of various characters such as éÉÀàùÙîû and others.

I tried looking around on Google and StackOverflow and found various answers that didn't fix my problem. I've seen answers recommending trying one of the following lines:

myString = éÀî

myString.encode('latin-1')
myString.encode('iso-8859-1')

unicode(myString, 'iso-8859-1')

but none of those seem to work.

The two cases where I need this are when I read a csv file with accents and with hardcoded strings containing accents. For instance, here's what a line in the csv file looks like (actually ';' is the separator):

Adam Guérin;[email protected];555-5555;2011-02-05

The 'é' looks fine, but when I try to fill a textField on the website with mechanize and submit it, the 'é' now looks like '�' on the actual website.

Edit:

This is my code for reading the data in the csv file:

subscriberReader = csv.reader(open(path, 'rb'), delimiter=';')
subscribers = []

for row in subscriberReader:
    subscribers.append(Subscriber(row[0], row[1], row[2]))

Then I send it to the website using mechanize:

self.br.select_form('aspnetForm')

self.br.form['fldEmail'] = subscriber.email
self.br.form['fldName'] = subscriber.name
self.br.form['fldPhoneNum'] = subscriber.phoneNum

self.br.submit()

I tried various ways to encode the characters, but I guess I'm not doing it correctly. I'll be glad to try anything that gets suggested in the answers / comments.

As for the website, it doesn't specify which encoding it is using in the header.

like image 403
Adam Smith Avatar asked Aug 02 '12 20:08

Adam Smith


1 Answers

First, you mentioned that you want to place literals into your code. To do so, you need to tell Python what encoding your script file has. You do this with a comment declaration at the beginning of the file (I'll assume that you're using latin-1).

# -*- coding: latin-1 -*-
myString = u'éÀî'

Second, you need to be able to work with the string. This isn't mechanize-specific, but covering a few basics should be useful: first, myString ends up being a unicode object (because of the way the literal was declared, with the u''). So, to use it as a Latin-1 encoding, you'll need to call .encode(), for example:

with open('test.txt', 'w') as f:
    f.write(myString.encode('latin-1'))

And finally, when reading in a string that is encoded (say, from the remote web site), you can use .decode() to decode it into a unicode object, and work with it from there.

with open('test.txt', 'r') as f:
    myString = f.read().decode('latin-1')
like image 172
voithos Avatar answered Oct 15 '22 12:10

voithos