I'm working with some text in python, it's already in unicode format internally but I would like to get rid of some special characters and replace them with more standard versions.
I currently have a line that looks like this, but it's getting ever more complex and I see it will eventually bring more trouble.
tmp = infile.lower().replace(u"\u2018", "'").replace(u"\u2019", "'").replace(u"\u2013", "").replace(u"\u2026", "")
for example the u\2018 and \u2019 are left and right single quotes. Those are somewhat acceptable but for this type of text processing I don't think they are needed.
Things like this u\2013 EN DASH and this HORIZONTAL ELLIPSIS are definitely not needed.
Is there a way to remove those quotation marks and use simple standard quotes that won't break text processing 'with nltk' and remove things like those EN DASH, HORIZONTAL ELLIPSIS without making such a monster call like I see starting to rear it's head in the sample code above?
If your text is in English and you want to clean it up in a human-readable way, use the third-party module unidecode. It replaces a wide range of characters with their nearest ascii look-alike. Just apply unidecode.unidecode() to any string to make the substitutions:
from unidecode import unidecode
clean = unidecode(u'Some text: \u2018\u2019\u2013\u03a9')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With