Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

I am trying to get a string to use in google geocoding api.I ve checked a lot of threads but I am still facing problem and I don't understand how to solve it.

I need addresse1 to be a string without any special characters. Addresse1 is for example: "32 rue d'Athènes Paris France".

addresse1= collect.replace(' ','+').replace('\n','') 
addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore') 

here I got a string without any accent... Ho no... It is not a string but a bytes. So I ve done what was suggested and 'decode:

addresse1=addresse1.decode('utf-8') 

But then addresse1 is exactly the same than at the begining... What do I have to do? What am I doing wrong? Or what i don't understand with unicode? Or is there a better solution?

Thanks,

Stéphane.

like image 761
Sulot Avatar asked Oct 25 '15 10:10

Sulot


2 Answers

with 3rd party package: unidecode

3>> unidecode.unidecode("32 rue d'Athènes Paris France")
"32 rue d'Athenes Paris France"
like image 178
Ignacio Vazquez-Abrams Avatar answered Nov 10 '22 03:11

Ignacio Vazquez-Abrams


addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore')

You probably meant .encode('ascii', 'ignore'), to remove non-ASCII characters. UTF-8 contains all characters, so encoding to it doesn't get rid of any, and an encode-decode cycle with it is a no-op.

is there a better solution?

It depends what you are trying to do.

If you only want to remove diacritical marks and not lose all other non-ASCII characters, you could read unicodedata.category for each character after NFKD-normalising and remove those in category M.

If you want to transliterate to ASCII that becomes a language-specific question that requires custom replacements (for example in German ö becomes oe, but not in Swedish).

If you just want to fudge a string into ASCII because having non-ASCII characters in it causes some code to break, it is of course much better to fix that code to work properly with all Unicode characters than to mangle good data. The letter è is not encodable in ASCII, but neither are 99.9989% of all characters so that hardly makes it “special”. Code that only supports ASCII is lame.

The Google Geocoding API can work with Unicode perfectly well so there is no obvious reason you should need to do any of this.

ETA:

url2= 'maps.googleapis.com/maps/api/geocode/json?address=' + addresse1 ...

Ah, you need to URL-encode any data you inject into a URL. That's not just for Unicode — the above will break for many ASCII punctuation symbols too. Use urllib.quote to encode a single string, or urllib.encode to convert multiple parameters:

params = dict(
    address=address1.encode('utf-8'),
    key=googlekey
)
url2 = '...?' + urllib.urlencode(params)

(in Python 3 it's urllib.parse.quote and urllib.parse.encode and they automatically choose UTF-8 so you don't have to manually encode there.)

data2 = urllib.request.urlopen(url2).read().decode('utf-8')
data3=json.loads(data2)

json.loads reads byte strings so you should be safe to omit the UTF-8 decode. Anyway json.load will read directly from a file-like object so you shouldn't have to load the data into a string at all:

data3 = json.load(urllib.request.urlopen(url2))
like image 4
bobince Avatar answered Nov 10 '22 03:11

bobince