Python removing extra special unicode characters

Question

I'm working with some text in python, it's already in unicode format internally but I would like to get rid of some special characters and replace them with more standard versions.

I currently have a line that looks like this, but it's getting ever more complex and I see it will eventually bring more trouble.

tmp = infile.lower().replace(u"\u2018", "'").replace(u"\u2019", "'").replace(u"\u2013", "").replace(u"\u2026", "")

for example the u\2018 and \u2019 are left and right single quotes. Those are somewhat acceptable but for this type of text processing I don't think they are needed.

Things like this u\2013 EN DASH and this HORIZONTAL ELLIPSIS are definitely not needed.

Is there a way to remove those quotation marks and use simple standard quotes that won't break text processing 'with nltk' and remove things like those EN DASH, HORIZONTAL ELLIPSIS without making such a monster call like I see starting to rear it's head in the sample code above?

alexis · Accepted Answer

If your text is in English and you want to clean it up in a human-readable way, use the third-party module unidecode. It replaces a wide range of characters with their nearest ascii look-alike. Just apply unidecode.unidecode() to any string to make the substitutions:

from unidecode import unidecode
clean = unidecode(u'Some text: \u2018\u2019\u2013\u03a9')

Python removing extra special unicode characters

Tags:

python

unicode

text-processing

special-characters

nltk

user1610950

1 Answers

alexis

Recent Activity

Donate For Us

Python removing extra special unicode characters

Tags:

python

unicode

text-processing

special-characters

nltk

user1610950

1 Answers

alexis

Related questions

Recent Activity

Donate For Us