I'm looking to use lime's explainer
within a udf on pyspark. I've previously trained the tabular explainer, and stored is as a dill model as suggested in link
loaded_explainer = dill.load(open('location_to_explainer','rb'))
def lime_explainer(*cols):
selected_cols = np.array([value for value in cols])
exp = loaded_explainer.explain_instance(selected_cols, loaded_model.predict_proba, num_features = 10)
mapping = exp.as_map()[1]
return str(mapping)
This however takes a lot of time, as it appears a lot of the computation happens on the driver. I've then been trying to use spark broadcast to broadcast the explainer to the executors.
broadcasted_explainer= sc.broadcast(loaded_explainer)
def lime_explainer(*col):
selected_cols = np.array([value for value in cols])
exp = broadcasted_explainer.value.explain_instance(selected_cols, loaded_model.predict_proba, num_features = 10)
mapping = exp.as_map()[1]
return str(mapping)
However, I run into a pickling error, on broadcast.
PicklingError: Can't pickle at 0x7f69fd5680d0>: attribute lookup on lime.discretize failed
Can anybody help with this? Is there something like dill
that we can use instead of the cloudpickler used in spark?
I'm the dill
author. I agree with @Majaha, and will extend @Majaha's answer slightly. In the first link in @Majaha's answer, it's clearly pointed out that a Broadcast
instance is hardwired to use pickle
... so the suggestion to dill
to a string, then undill
afterward is a good one.
Unfortunately, the extend
method probably won't work for you. In the Broadcast
class, the source uses CPickle
, which dill
cannot extend.
If you look at the source, it uses import CPickle as pickle; ... pickle.dumps
for python 2, and import pickle; ... pickle.dumps
for python 3. Had it used import pickle; ... pickle.dumps
for python 2, and import pickle; ... pickle._dumps
for python 3, then dill
could extend the pickler by just doing an import dill
. For example:
Python 3.6.6 (default, Jun 28 2018, 05:53:46)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pickle import _dumps
>>> import dill
>>> _dumps(lambda x:x)
b'\x80\x03cdill._dill\n_create_function\nq\x00(cdill._dill\n_load_type\nq\x01X\x08\x00\x00\x00CodeTypeq\x02\x85q\x03Rq\x04(K\x01K\x00K\x01K\x01KCC\x04|\x00S\x00q\x05N\x85q\x06)X\x01\x00\x00\x00xq\x07\x85q\x08X\x07\x00\x00\x00<stdin>q\tX\x08\x00\x00\x00<lambda>q\nK\x01C\x00q\x0b))tq\x0cRq\rc__main__\n__dict__\nh\nNN}q\x0etq\x0fRq\x10.'
You could, thus, either do what @Majaha suggests (and bookend the call to broadcast
) or you could patch the code to make the replacement that I outline above (where needed, but eh...), or you may be able to make your own derived class that does the job using dill
:
>>> from pyspark.broadcast import Broadcast as _Broadcast
>>>
>>> class Broadcast(_Broadcast):
... def dump(self, value, f):
... try:
... import dill
... dill.dump(value, f, pickle_protocol)
... ...[INSERT THE REST OF THE DUMP METHOD HERE]...
If the above fails... you could still get it to work by pinpointing where the serialization failure occurs (there's dill.detect.trace
to help you with that).
If you are going to suggest to pyspark
to use dill
... a potentially better suggestion is to allow users to dynamically replace the serializer. This is what mpi4py
and a few other packages do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With