I'm trying to write a function to be executed in several IPython engines. The function takes a pandas Series as an argument. Each element of the Series is a string, and the whole Series constitutes a corpus for TF.IDF computation.
After reading IPython parallel documentation and some tutorials, it seems to be quite straightforward to do, and I came up with the following:
import pandas as pd
from IPython.parallel import Client
def calculemus(corpus):
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
return vectorizer.fit_transform(corpus)
review = pd.read_csv('review.csv')['text']
review = review.fillna('')
client = Client()
r = client[-1].apply(calculemus, review).get()
BUT I got this error instead:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)/xxx/site-packages/IPython/zmq/serialize.pyc in unpack_apply_message(bufs, g, copy)
154 sa.data = m.bytes
155
--> 156 args = uncanSequence(map(unserialize, sargs), g)
157 kwargs = {}
158 for k in sorted(skwargs.iterkeys()):
/xxx/site-packages/IPython/utils/newserialized.pyc in unserialize(serialized)
175
176 def unserialize(serialized):
--> 177 return UnSerializeIt(serialized).getObject()
/xxx/site-packages/IPython/utils/newserialized.pyc in getObject(self)
159 buf = self.serialized.getData()
160 if isinstance(buf, (bytes, buffer, memoryview)):
--> 161 result = numpy.frombuffer(buf, dtype = self.serialized.metadata['dtype'])
162 else:
163 raise TypeError("Expected bytes or buffer/memoryview, but got %r"%type(buf))
ValueError: cannot create an OBJECT array from memory buffer
I'm not sure what the problem is, could someone enlighten me on this?
Apparently the error says exactly what it says. If I do this:
r = client[-1].apply(calculemus, np.array(review, dtype=str)).get()
it kinda works.
So the next question is, is this a feature or a limitation of IPython?
This is a bug in IPython 0.13 that should be fixed in master. There is a special case for serializing numpy arrays that avoids copying data, and this behavior is triggered by an isinstance(numpy.ndarray) check. This was inappropriate, because isinstance catches subclasses, which includes pandas objects, but those pandas objects (and array subclasses in general) should not be treated in the same way, as metadata will be lost, and reconstruction on the other side will often fail.
r = client[-1].apply(calculemus, np.array(review, dtype=str)).get()
is equivalent to
r = client[-1].apply_sync(calculemus, np.array(review, dtype=str))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With