The prevalent amount of NLTK documentation and examples is devoted to lemmatization and stemming but is very sparse on such matters of normalization as:
Please point me where in NLTK to dig. Any NLTK equivalents (JAVA or any other) for aforementioned purposes are welcome. Thanks.
UPD. I have written a python library of text normalization for the text-to-speech purposes https://github.com/soshial/text-normalization. It might suit you as well.
Also in NLTK spec a lot of (sub-)tasks are solved using purely python methods.
a) converting all letters to lower or upper case
text='aiUOd' print text.lower() >> 'aiuod' print text.upper() >> 'AIUOD'
b) removing punctuation
text='She? Hm, why not!' puncts='.?!' for sym in puncts: text= text.replace(sym,' ') print text >> 'She Hm why not '
c) converting numbers into words
Here, it would be not that wasy to write a fewliner, but there are a lot of already existing solutions, if you google it. Code snippets, libraries etc
d) removing accent marks and other diacritics
look up point b), just create the list with diacritics as puncts
e) expanding abbreviations
Create a dictionary with abbreviations:
text='USA and GB are ...' abbrevs={'USA':'United States','GB':'Great Britain'} for abbrev in abbrevs: text= text.replace(abbrev,abbrevs[abbrev]) print text >> 'United States and Great Britain are ...'
f) removing stopwords or "too common" words
Create a list with stopwords:
text='Mary had a little lamb' temp_corpus=text.split(' ') stops=['a','the','had'] corpus=[token for token in temp_corpus if token not in stops] print corpus >> ['Mary', 'little', 'lamb']
g) text canonicalization (tumor = tumour, it's = it is)
for tumor-> tumour use regex.
Last, but not least, please note that all of the examples above usually need calibration on the real textes, I wrote them as the direction to go.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With