I am trying to I am tring to delete stop words via spark,the code is as follow
from nltk.corpus import stopwords
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
word_list=["ourselves","out","over", "own", "same" ,"shan't" ,"she", "she'd", "what", "the", "fuck", "is", "this","world","too","who","who's","whom","yours","yourself","yourselves"]
wordlist=spark.createDataFrame([word_list]).rdd
def stopwords_delete(word_list):
filtered_words=[]
print word_list
for word in word_list:
print word
if word not in stopwords.words('english'):
filtered_words.append(word)
filtered_words=wordlist.map(stopwords_delete)
print(filtered_words)
and I got the error as follow:
pickle.PicklingError: args[0] from newobj args has the wrong class
I don't know why,can somebody help me.
Thanks in advance
It's to do with uploading of stop words module. As a work around import stopwords library with in the function itself. please see the similar issue linked below. I had the same issue and this work around fixed the problem.
def stopwords_delete(word_list):
from nltk.corpus import stopwords
filtered_words=[]
print word_list
Similar Issue
I would recommend from pyspark.ml.feature import StopWordsRemover
as permanent fix.
Probably, it's just because you are defining the stopwords.words('english') every time on the executor. Define it outside and this would work.
You are using map over a rdd which has only one row and each word as a column.so, the entire row of rdd which is of type is passed to stopwords_delete fuction and in the for loop within that, is trying to match rdd to stopwords and it fails.Try like this,
filtered_words=stopwords_delete(wordlist.flatMap(lambda x:x).collect())
print(filtered_words)
I got this output as filtered_words,
["shan't", "she'd", 'fuck', 'world', "who's"]
Also, include a return in your function.
Another way, you could use list comprehension to replace the stopwords_delete fuction,
filtered_words = wordlist.flatMap(lambda x:[i for i in x if i not in stopwords.words('english')]).collect()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With