Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing non-English words from text using Python

Tags:

I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on Python using a tool kit like nltk.

For example given some text :

"Io andiamo to the beach with my amico." 

I would like to be left with :

"to the beach with my"  

Does anyone know of a way as to how this could be done? Any help would be much appreciated.

like image 633
Andre Croucher Avatar asked Dec 22 '16 19:12

Andre Croucher


2 Answers

You can use the words corpus from NLTK:

import nltk words = set(nltk.corpus.words.words())  sent = "Io andiamo to the beach with my amico." " ".join(w for w in nltk.wordpunct_tokenize(sent) \          if w.lower() in words or not w.isalpha()) # 'Io to the beach with my' 

Unfortunately, Io happens to be an English word. In general, it may be hard to decide whether a word is English or not.

like image 177
DYZ Avatar answered Oct 03 '22 17:10

DYZ


In MAC OSX it still can show an exception if you try this code. So make sure you download the words corpus manually. Once you import your nltk library, make you might as in mac os it does not download the words corpus automatically. So you have to download it potentially otherwise you will face exception.

import nltk  nltk.download('words') words = set(nltk.corpus.words.words()) 

Now you can perform same execution as previous person directed.

sent = "Io andiamo to the beach with my amico." sent = " ".join(w for w in nltk.wordpunct_tokenize(sent) if w.lower() in words or not w.isalpha()) 

According to NLTK documentation it doesn't say so. But I got a issue over github and solved that way and it really works. If you don't put the word parameter there, you OSX can logg off and happen again and again.

like image 41
gdmanandamohon Avatar answered Oct 03 '22 17:10

gdmanandamohon