Python remove stop words from pandas dataframe




I want to remove the stop words from my column "tweets". How do I iterative over each row and each item?

pos_tweets = [('I love this car', 'positive'),     ('This view is amazing', 'positive'),     ('I feel great this morning', 'positive'),     ('I am so excited about the concert', 'positive'),     ('He is my best friend', 'positive')]  test = pd.DataFrame(pos_tweets) test.columns = ["tweet","class"] test["tweet"] = test["tweet"].str.lower().str.split()  from nltk.corpus import stopwords stop = stopwords.words('english') 
1 Answers

We can import stopwords from nltk.corpus as below. With that, We exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.

# Import stopwords with nltk. from nltk.corpus import stopwords stop = stopwords.words('english')  pos_tweets = [('I love this car', 'positive'),     ('This view is amazing', 'positive'),     ('I feel great this morning', 'positive'),     ('I am so excited about the concert', 'positive'),     ('He is my best friend', 'positive')]  test = pd.DataFrame(pos_tweets) test.columns = ["tweet","class"]  # Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply. test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)])) print(test) # Out[40]: #                                tweet     class tweet_without_stopwords # 0                    I love this car  positive              I love car # 1               This view is amazing  positive       This view amazing # 2          I feel great this morning  positive    I feel great morning # 3  I am so excited about the concert  positive       I excited concert # 4               He is my best friend  positive          He best friend 

It can also be excluded by using pandas.Series.str.replace.

pat = r'\b(?:{})\b'.format('|'.join(stop)) test['tweet_without_stopwords'] = test['tweet'].str.replace(pat, '') test['tweet_without_stopwords'] = test['tweet_without_stopwords'].str.replace(r'\s+', ' ') # Same results. # 0              I love car # 1       This view amazing # 2    I feel great morning # 3       I excited concert # 4          He best friend 

If you can not import stopwords, you can download as follows.

import nltk nltk.download('stopwords') 

Another way to answer is to import text.ENGLISH_STOP_WORDS from sklearn.feature_extraction.

# Import stopwords with scikit-learn from sklearn.feature_extraction import text stop = text.ENGLISH_STOP_WORDS 

Notice that the number of words in the scikit-learn stopwords and nltk stopwords are different.

