I've developed a spam classifier using pandas and scikit learn to the point where it's ready for integration into our hadoop-based system. To this end, I need to export my classifier to a more common format than pickling.
The Predictive Model Markup Language (PMML) is my preferred export format. It plays exceedingly well with Cascading, which we already use. However, I surprisingly cannot find any python libraries that export scikit-learn models into PMML.
Has anyone had experience with this use case? Is there any sort of alternative to PMML that would lend interoperability between scikit-learn and hadoop? What about a solid PMML export library?
You could use Py2PMML to export the model to PMML and then evaluate it on Hadoop using JPMML-Cascading. JPMML is open source but Py2PMML from Zementis seems to be a commercial product. Besides this alternative there are no other tools to score Scikit models exported as PMML on Java/Hadoop. The core scikit team is planning to implement a PMML exporter though. But if you don't want any commercial solutions or wait for such tool to be implemented you still have some options but they require some coding:
export_graphviz
function obtain the DOT representation of each decision tree and write a small Java interpreter.Hope it helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With