Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get rid of stopwords and punctuation

I'm struggling with NLTK stopword.

Here's my bit of code.. Could someone tell me what's wrong?

from nltk.corpus import stopwords

def removeStopwords( palabras ):
     return [ word for word in palabras if word not in stopwords.words('spanish') ]

palabras = ''' my text is here '''
like image 595
Cappai Avatar asked Apr 04 '11 16:04

Cappai


People also ask

How do you remove Stopwords and punctuation in Python?

In order to remove stopwords and punctuation using NLTK, we have to download all the stop words using nltk. download('stopwords'), then we have to specify the language for which we want to remove the stopwords, therefore, we use stopwords. words('english') to specify and save it to the variable.

How do you remove Stopwords in a sentence?

To remove stop words from a sentence, you can divide your text into words and then remove the word if it exits in the list of stop words provided by NLTK. In the script above, we first import the stopwords collection from the nltk. corpus module.

Should Stopwords be removed?

Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information.

How do you remove punctuation with NLTK?

The workflow assumed by NLTK is that you first tokenize into sentences and then every sentence into words. That is why word_tokenize() does not work with multiple sentences. To get rid of the punctuation, you can use a regular expression or python's isalnum() function.


2 Answers

Your problem is that the iterator for a string returns each character not each word.

For example:

>>> palabras = "Buenos dias"
>>> [c for c in palabras]
['B', 'u', 'e', 'n', 'a', 's', ' ', 'd', 'i', 'a', 's']

You need to iterate and check each word, fortunately the split function already exists in the python standard library under the string module. However you are dealing with natural language including punctuation you should look here for a more robust answer that uses the re module.

Once you have a list of words you should lowercase them all before comparison and then compare them in the manner that you have shown already.

Buena suerte.

EDIT 1

Okay try this code, it should work for you. It shows two ways to do it, they are essentially identical but the first is a bit clearer while the second is more pythonic.

import re
from nltk.corpus import stopwords

scentence = 'El problema del matrimonio es que se acaba todas las noches despues de hacer el amor, y hay que volver a reconstruirlo todas las mananas antes del desayuno.'

#We only want to work with lowercase for the comparisons
scentence = scentence.lower() 

#remove punctuation and split into seperate words
words = re.findall(r'\w+', scentence,flags = re.UNICODE | re.LOCALE) 

#This is the simple way to remove stop words
important_words=[]
for word in words:
    if word not in stopwords.words('spanish'):
        important_words.append(word)

print important_words

#This is the more pythonic way
important_words = filter(lambda x: x not in stopwords.words('spanish'), words)

print important_words 

I hope this helps you.

like image 145
JHSaunders Avatar answered Nov 12 '22 06:11

JHSaunders


Using a tokenizer first you compare a list of tokens (symbols) against the stoplist, so you don't need the re module. I added an extra argument in order to switch among languages.

def remove_stopwords(sentence, language):
    return [ token for token in nltk.word_tokenize(sentence) if token.lower() not in stopwords.words(language) ]

Dime si te fue de util ;)

like image 45
alemol Avatar answered Nov 12 '22 04:11

alemol