I'm writing a little Python script that parses word docs and writes to a csv file. However, some of the docs have some utf-8 characters that my script can't process correctly.
Fancy quotes show up quite often (u'\u201c'). Is there a quick and easy (and smart) way of replacing those with the neutral ascii-supported quotes, so I can just write line.encode('ascii')
to the csv file?
I have tried to find the left quote and replace it:
val = line.find(u'\u201c')
if val >= 0: line[val] = '"'
But to no avail:
TypeError: 'unicode' object does not support item assignment
Is what I've described a good strategy? Or should I just set up the csv to support utf-8 (though I'm not sure if the application that will be reading the CSV wants utf-8)?
Thank you
You can use the Unidecode
package to automatically convert all Unicode characters to their nearest pure ASCII equivalent.
from unidecode import unidecode
line = unidecode(line)
This will handle both directions of double quotes as well as single quotes, em dashes, and other things that you probably haven't discovered yet.
Edit: a comment points out if your language isn't English, you may find ASCII to be too restrictive. Here's an adaptation of the above code that uses a whitelist to indicate characters that shouldn't be converted.
>>> from unidecode import unidecode
>>> whitelist = set('µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ')
>>> line = '\u201cRésumé\u201d'
>>> print(line)
“Résumé”
>>> line = ''.join(c if c in whitelist else unidecode(c) for c in line)
>>> print(line)
"Résumé"
You can't assign to a string, as they are immutable, and can't be changed.
You can, however, just use the regex library, which might be the most flexible way to do this:
import re
newline = re.sub(u'\u201c','"',line)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With