Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are there any classes in NLTK for text normalizing and canonizing?

Tags:

The prevalent amount of NLTK documentation and examples is devoted to lemmatization and stemming but is very sparse on such matters of normalization as:

  • converting all letters to lower or upper case
  • removing punctuation
  • converting numbers into words
  • removing accent marks and other diacritics
  • expanding abbreviations
  • removing stopwords or "too common" words
  • text canonicalization (tumor = tumour, it's = it is)

Please point me where in NLTK to dig. Any NLTK equivalents (JAVA or any other) for aforementioned purposes are welcome. Thanks.

UPD. I have written a python library of text normalization for the text-to-speech purposes https://github.com/soshial/text-normalization. It might suit you as well.

like image 259
soshial Avatar asked Feb 10 '12 12:02

soshial


1 Answers

Also in NLTK spec a lot of (sub-)tasks are solved using purely python methods.

a) converting all letters to lower or upper case

text='aiUOd' print text.lower() >> 'aiuod' print text.upper() >> 'AIUOD' 

b) removing punctuation

text='She? Hm, why not!' puncts='.?!' for sym in puncts:     text= text.replace(sym,' ') print text >> 'She  Hm  why not ' 

c) converting numbers into words

Here, it would be not that wasy to write a fewliner, but there are a lot of already existing solutions, if you google it. Code snippets, libraries etc

d) removing accent marks and other diacritics

look up point b), just create the list with diacritics as puncts

e) expanding abbreviations

Create a dictionary with abbreviations:

text='USA and GB are ...' abbrevs={'USA':'United States','GB':'Great Britain'} for abbrev in abbrevs:     text= text.replace(abbrev,abbrevs[abbrev]) print text >> 'United States and Great Britain are ...' 

f) removing stopwords or "too common" words

Create a list with stopwords:

text='Mary had a little lamb' temp_corpus=text.split(' ') stops=['a','the','had'] corpus=[token for token in temp_corpus if token not in stops] print corpus >> ['Mary', 'little', 'lamb'] 

g) text canonicalization (tumor = tumour, it's = it is)

for tumor-> tumour use regex.

Last, but not least, please note that all of the examples above usually need calibration on the real textes, I wrote them as the direction to go.

like image 148
Max Li Avatar answered Dec 04 '22 03:12

Max Li