Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python

I am trying to I am tring to delete stop words via spark,the code is as follow

from nltk.corpus import stopwords
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext('local')
spark = SparkSession(sc)
word_list=["ourselves","out","over", "own", "same" ,"shan't" ,"she", "she'd", "what", "the", "fuck", "is", "this","world","too","who","who's","whom","yours","yourself","yourselves"]

wordlist=spark.createDataFrame([word_list]).rdd

def stopwords_delete(word_list):
    filtered_words=[]
    print word_list



    for word in word_list:
        print word
        if word not in stopwords.words('english'):
            filtered_words.append(word)



filtered_words=wordlist.map(stopwords_delete)
print(filtered_words)

and I got the error as follow:

pickle.PicklingError: args[0] from newobj args has the wrong class

I don't know why,can somebody help me.
Thanks in advance

like image 348
Tiana Avatar asked Jul 04 '17 17:07

Tiana


3 Answers

It's to do with uploading of stop words module. As a work around import stopwords library with in the function itself. please see the similar issue linked below. I had the same issue and this work around fixed the problem.

    def stopwords_delete(word_list):
        from nltk.corpus import stopwords
        filtered_words=[]
        print word_list

Similar Issue

I would recommend from pyspark.ml.feature import StopWordsRemover as permanent fix.

like image 50
Shankar Avatar answered Sep 28 '22 09:09

Shankar


Probably, it's just because you are defining the stopwords.words('english') every time on the executor. Define it outside and this would work.

like image 32
Abhishek Gupta Avatar answered Sep 28 '22 08:09

Abhishek Gupta


You are using map over a rdd which has only one row and each word as a column.so, the entire row of rdd which is of type is passed to stopwords_delete fuction and in the for loop within that, is trying to match rdd to stopwords and it fails.Try like this,

filtered_words=stopwords_delete(wordlist.flatMap(lambda x:x).collect())
print(filtered_words)

I got this output as filtered_words,

["shan't", "she'd", 'fuck', 'world', "who's"]

Also, include a return in your function.

Another way, you could use list comprehension to replace the stopwords_delete fuction,

filtered_words = wordlist.flatMap(lambda x:[i for i in x if i not in stopwords.words('english')]).collect()
like image 35
Suresh Avatar answered Sep 28 '22 07:09

Suresh