Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using python lime as a udf on spark

I'm looking to use lime's explainer within a udf on pyspark. I've previously trained the tabular explainer, and stored is as a dill model as suggested in link

loaded_explainer = dill.load(open('location_to_explainer','rb'))

def lime_explainer(*cols):
    selected_cols = np.array([value for value in cols])
    exp = loaded_explainer.explain_instance(selected_cols, loaded_model.predict_proba, num_features = 10)
    mapping = exp.as_map()[1]

    return str(mapping)

This however takes a lot of time, as it appears a lot of the computation happens on the driver. I've then been trying to use spark broadcast to broadcast the explainer to the executors.

broadcasted_explainer= sc.broadcast(loaded_explainer)

def lime_explainer(*col):
    selected_cols = np.array([value for value in cols])
    exp = broadcasted_explainer.value.explain_instance(selected_cols, loaded_model.predict_proba, num_features = 10)
    mapping = exp.as_map()[1]

    return str(mapping)        

However, I run into a pickling error, on broadcast.

PicklingError: Can't pickle at 0x7f69fd5680d0>: attribute lookup on lime.discretize failed

Can anybody help with this? Is there something like dill that we can use instead of the cloudpickler used in spark?

like image 709
ArunK Avatar asked Mar 26 '19 08:03

ArunK


1 Answers

I'm the dill author. I agree with @Majaha, and will extend @Majaha's answer slightly. In the first link in @Majaha's answer, it's clearly pointed out that a Broadcast instance is hardwired to use pickle... so the suggestion to dill to a string, then undill afterward is a good one.

Unfortunately, the extend method probably won't work for you. In the Broadcast class, the source uses CPickle, which dill cannot extend. If you look at the source, it uses import CPickle as pickle; ... pickle.dumps for python 2, and import pickle; ... pickle.dumps for python 3. Had it used import pickle; ... pickle.dumps for python 2, and import pickle; ... pickle._dumps for python 3, then dill could extend the pickler by just doing an import dill. For example:

Python 3.6.6 (default, Jun 28 2018, 05:53:46) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pickle import _dumps
>>> import dill
>>> _dumps(lambda x:x)
b'\x80\x03cdill._dill\n_create_function\nq\x00(cdill._dill\n_load_type\nq\x01X\x08\x00\x00\x00CodeTypeq\x02\x85q\x03Rq\x04(K\x01K\x00K\x01K\x01KCC\x04|\x00S\x00q\x05N\x85q\x06)X\x01\x00\x00\x00xq\x07\x85q\x08X\x07\x00\x00\x00<stdin>q\tX\x08\x00\x00\x00<lambda>q\nK\x01C\x00q\x0b))tq\x0cRq\rc__main__\n__dict__\nh\nNN}q\x0etq\x0fRq\x10.'

You could, thus, either do what @Majaha suggests (and bookend the call to broadcast) or you could patch the code to make the replacement that I outline above (where needed, but eh...), or you may be able to make your own derived class that does the job using dill:

>>> from pyspark.broadcast import Broadcast as _Broadcast
>>>
>>> class Broadcast(_Broadcast):
...   def dump(self, value, f):
...     try:
...       import dill
...       dill.dump(value, f, pickle_protocol)
...     ...[INSERT THE REST OF THE DUMP METHOD HERE]...

If the above fails... you could still get it to work by pinpointing where the serialization failure occurs (there's dill.detect.trace to help you with that).

If you are going to suggest to pyspark to use dill... a potentially better suggestion is to allow users to dynamically replace the serializer. This is what mpi4py and a few other packages do.

like image 95
Mike McKerns Avatar answered Nov 01 '22 17:11

Mike McKerns