IPython.parallel ValueError: cannot create an OBJECT array from memory buffer

Question

I'm trying to write a function to be executed in several IPython engines. The function takes a pandas Series as an argument. Each element of the Series is a string, and the whole Series constitutes a corpus for TF.IDF computation.

After reading IPython parallel documentation and some tutorials, it seems to be quite straightforward to do, and I came up with the following:

import pandas as pd
from IPython.parallel import Client


def calculemus(corpus):
    from sklearn.feature_extraction.text import TfidfVectorizer

    vectorizer = TfidfVectorizer(min_df=1, stop_words='english')

    return vectorizer.fit_transform(corpus)


review = pd.read_csv('review.csv')['text']
review = review.fillna('')

client = Client()

r = client[-1].apply(calculemus, review).get()

BUT I got this error instead:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)/xxx/site-packages/IPython/zmq/serialize.pyc in unpack_apply_message(bufs, g, copy)
    154                     sa.data = m.bytes
    155 
--> 156     args = uncanSequence(map(unserialize, sargs), g)
    157     kwargs = {}
    158     for k in sorted(skwargs.iterkeys()):
/xxx/site-packages/IPython/utils/newserialized.pyc in unserialize(serialized)
    175 
    176 def unserialize(serialized):
--> 177     return UnSerializeIt(serialized).getObject()
/xxx/site-packages/IPython/utils/newserialized.pyc in getObject(self)
    159                 buf = self.serialized.getData()
    160                 if isinstance(buf, (bytes, buffer, memoryview)):
--> 161                     result = numpy.frombuffer(buf, dtype = self.serialized.metadata['dtype'])
    162                 else:
    163                     raise TypeError("Expected bytes or buffer/memoryview, but got %r"%type(buf))
ValueError: cannot create an OBJECT array from memory buffer

I'm not sure what the problem is, could someone enlighten me on this?

UPDATE

Apparently the error says exactly what it says. If I do this:

r = client[-1].apply(calculemus, np.array(review, dtype=str)).get()

it kinda works.

So the next question is, is this a feature or a limitation of IPython?

minrk · Accepted Answer

This is a bug in IPython 0.13 that should be fixed in master. There is a special case for serializing numpy arrays that avoids copying data, and this behavior is triggered by an isinstance(numpy.ndarray) check. This was inappropriate, because isinstance catches subclasses, which includes pandas objects, but those pandas objects (and array subclasses in general) should not be treated in the same way, as metadata will be lost, and reconstruction on the other side will often fail.

PS:

r = client[-1].apply(calculemus, np.array(review, dtype=str)).get()

is equivalent to

r = client[-1].apply_sync(calculemus, np.array(review, dtype=str))

PS:

r = client[-1].apply(calculemus, np.array(review, dtype=str)).get()

is equivalent to

r = client[-1].apply_sync(calculemus, np.array(review, dtype=str))

IPython.parallel ValueError: cannot create an OBJECT array from memory buffer

Tags:

python

ipython

scikit-learn

ipython-parallel

UPDATE

herrfz

1 Answers

PS:

minrk

Recent Activity

Donate For Us

IPython.parallel ValueError: cannot create an OBJECT array from memory buffer

Tags:

python

ipython

scikit-learn

ipython-parallel

UPDATE

herrfz

1 Answers

PS:

minrk

Related questions

Recent Activity

Donate For Us