Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stopword removal with NLTK and Pandas

I have some issues with Pandas and NLTK. I am new at programming, so excuse me if i ask questions that might be easy to solve. I have a csv file which has 3 columns(Id,Title,Body) and about 15.000 rows.

My goal is to remove the stopwords from this csv file. The operation for lowercase and split are working well. But i can not find my mistake why the stopwords does not get removed. What am i missing?

    import pandas as pd
    from nltk.corpus import stopwords

    pd.read_csv("test10in.csv", encoding="utf-8") 

    df = pd.read_csv("test10in.csv") 

    df.columns = ['Id','Title','Body']
    df['Title'] = df['Title'].str.lower().str.split()  
    df['Body'] = df['Body'].str.lower().str.split() 


    stop = stopwords.words('english')

    df['Title'].apply(lambda x: [item for item in x if item not in stop])
    df['Body'].apply(lambda x: [item for item in x if item not in stop])

    df.to_csv("test10out.csv")
like image 734
slm Avatar asked Oct 20 '15 19:10

slm


People also ask

How do I remove Stopwords from pandas?

We use Pandas apply with the lambda function and list comprehension to remove stop words declared in NLTK.

How do I remove Stopwords in NLTK?

Using Python's NLTK Library NLTK supports stop word removal, and you can find the list of stop words in the corpus module. To remove stop words from a sentence, you can divide your text into words and then remove the word if it exits in the list of stop words provided by NLTK.

How do I get rid of punctuation in pandas?

To remove punctuation with Python Pandas, we can use the DataFrame's str. replace method. We call replace with a regex string that matches all punctuation characters and replace them with empty strings. replace returns a new DataFrame column and we assign that to df['text'] .

How do you remove stop words and punctuations in Python?

In order to remove stopwords and punctuation using NLTK, we have to download all the stop words using nltk. download('stopwords'), then we have to specify the language for which we want to remove the stopwords, therefore, we use stopwords. words('english') to specify and save it to the variable.


1 Answers

you are trying to do an inplace replace. you should do

   df['Title'] = df['Title'].apply(lambda x: [item for item in x if item not in stop])
    df['Body'] = df['Body'].apply(lambda x: [item for item in x if item not in stop])
like image 85
AbtPst Avatar answered Sep 26 '22 02:09

AbtPst