Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python remove stop words from pandas dataframe

Tags:

python

pandas

I want to remove the stop words from my column "tweets". How do I iterative over each row and each item?

pos_tweets = [('I love this car', 'positive'),     ('This view is amazing', 'positive'),     ('I feel great this morning', 'positive'),     ('I am so excited about the concert', 'positive'),     ('He is my best friend', 'positive')]  test = pd.DataFrame(pos_tweets) test.columns = ["tweet","class"] test["tweet"] = test["tweet"].str.lower().str.split()  from nltk.corpus import stopwords stop = stopwords.words('english') 
like image 792
collarblind Avatar asked Apr 08 '15 19:04

collarblind


People also ask

How do I remove stop words from pandas DataFrame?

We use Pandas apply with the lambda function and list comprehension to remove stop words declared in NLTK.

How do you exclude words from a stop in Python?

To remove stop words from a sentence, you can divide your text into words and then remove the word if it exits in the list of stop words provided by NLTK. In the script above, we first import the stopwords collection from the nltk. corpus module. Next, we import the word_tokenize() method from the nltk.

How do you remove a specific word from a DataFrame in Python?

With the replace() function, we can create a new string where the specified value is replaced by another specified value. We can use the replace() function to remove words from a string. To remove a given word from a string, you can use replace() and pass an empty string as the replacement value as shown below.


1 Answers

We can import stopwords from nltk.corpus as below. With that, We exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.

# Import stopwords with nltk. from nltk.corpus import stopwords stop = stopwords.words('english')  pos_tweets = [('I love this car', 'positive'),     ('This view is amazing', 'positive'),     ('I feel great this morning', 'positive'),     ('I am so excited about the concert', 'positive'),     ('He is my best friend', 'positive')]  test = pd.DataFrame(pos_tweets) test.columns = ["tweet","class"]  # Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply. test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)])) print(test) # Out[40]: #                                tweet     class tweet_without_stopwords # 0                    I love this car  positive              I love car # 1               This view is amazing  positive       This view amazing # 2          I feel great this morning  positive    I feel great morning # 3  I am so excited about the concert  positive       I excited concert # 4               He is my best friend  positive          He best friend 

It can also be excluded by using pandas.Series.str.replace.

pat = r'\b(?:{})\b'.format('|'.join(stop)) test['tweet_without_stopwords'] = test['tweet'].str.replace(pat, '') test['tweet_without_stopwords'] = test['tweet_without_stopwords'].str.replace(r'\s+', ' ') # Same results. # 0              I love car # 1       This view amazing # 2    I feel great morning # 3       I excited concert # 4          He best friend 

If you can not import stopwords, you can download as follows.

import nltk nltk.download('stopwords') 

Another way to answer is to import text.ENGLISH_STOP_WORDS from sklearn.feature_extraction.

# Import stopwords with scikit-learn from sklearn.feature_extraction import text stop = text.ENGLISH_STOP_WORDS 

Notice that the number of words in the scikit-learn stopwords and nltk stopwords are different.

like image 62
Keiku Avatar answered Sep 29 '22 00:09

Keiku