Get rid of stopwords and punctuation

Tags:

I'm struggling with NLTK stopword.

Here's my bit of code.. Could someone tell me what's wrong?

from nltk.corpus import stopwords

def removeStopwords( palabras ):
     return [ word for word in palabras if word not in stopwords.words('spanish') ]

palabras = ''' my text is here '''

595

asked Apr 04 '11 16:04

Cappai

2 Answers

Your problem is that the iterator for a string returns each character not each word.

For example:

>>> palabras = "Buenos dias"
>>> [c for c in palabras]
['B', 'u', 'e', 'n', 'a', 's', ' ', 'd', 'i', 'a', 's']

You need to iterate and check each word, fortunately the split function already exists in the python standard library under the string module. However you are dealing with natural language including punctuation you should look here for a more robust answer that uses the re module.

Once you have a list of words you should lowercase them all before comparison and then compare them in the manner that you have shown already.

Buena suerte.

EDIT 1

Okay try this code, it should work for you. It shows two ways to do it, they are essentially identical but the first is a bit clearer while the second is more pythonic.

import re
from nltk.corpus import stopwords

scentence = 'El problema del matrimonio es que se acaba todas las noches despues de hacer el amor, y hay que volver a reconstruirlo todas las mananas antes del desayuno.'

#We only want to work with lowercase for the comparisons
scentence = scentence.lower() 

#remove punctuation and split into seperate words
words = re.findall(r'\w+', scentence,flags = re.UNICODE | re.LOCALE) 

#This is the simple way to remove stop words
important_words=[]
for word in words:
    if word not in stopwords.words('spanish'):
        important_words.append(word)

print important_words

#This is the more pythonic way
important_words = filter(lambda x: x not in stopwords.words('spanish'), words)

print important_words

I hope this helps you.

145

answered Nov 12 '22 06:11

JHSaunders

Using a tokenizer first you compare a list of tokens (symbols) against the stoplist, so you don't need the re module. I added an extra argument in order to switch among languages.

def remove_stopwords(sentence, language):
    return [ token for token in nltk.word_tokenize(sentence) if token.lower() not in stopwords.words(language) ]

Dime si te fue de util ;)

answered Nov 12 '22 04:11

alemol

Related questions
                            
                                sudo pip install python-Levenshtein failed with error code 1
                            
                                Python Function Return Loop
                            
                                Pandas how to delete alternate rows [duplicate]
                            
                                Bold font in Label with setBold method
                            
                                Python Selenium On Local HTML String
                            
                                PIL Image to QPixmap conversion issue
                            
                                Overwrite a file with Dropbox API v2 in Python
                            
                                Cufflinks for plotly: setting cufflinks config options launches
                            
                                What is the best way to check URL change with Selenium in Python?
                            
                                Connecting to AWS Elasticsearch instance using Python
                            
                                Plot normal distribution in 3D
                            
                                Pandas DENSE RANK
                            
                                Dictionary changed size during iteration - Code works in Py2 Not in Py3
                            
                                Speeding (Bulk) Insert into MySQL with Python
                            
                                how do i use python libraries in C++?
                            
                                Does python have something like C++'s using keyword?
                            
                                Newbie teaching self python, what else should I be learning? [closed]
                            
                                Sqlalchemy complex in_ clause with tuple in list of tuples
                            
                                Python: how to change (last) element of tuple?
                            
                                How can I run my python script from the terminal in Mac OS X without having to type the full path?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get rid of stopwords and punctuation

Tags:

python

nltk

stop-words