Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to fit and score a machine learning models in Java/JVM based application

Could you please guide me on how to create and execute a machine learning models/statistical models (regression, Decision tree, K means clustering, Naive bayes, scorecard/linear/logistic regression etc. and GBM, GLM ) in Java/JVM based application (in production).

We have an ETL sort of Java based product where one can do most of data Preparation steps for machine learning, like data ingestion from JDBC, files, HDFS, No SQL etc., joins and aggregations etc.(which are required for Feature engineering) and now we want to add Analytics capabilities using machine learning/statistical modeling.

Right now, we are using JPMML- evaluator to score the models created in PMML format using R and python (and Knime) but it needs three separate and unconnected steps:- 1- first step for data preparation in our Java/JVM application and save the sampling data (training and test) data in csv file or in DB, - 2- Create a machine learning Model in R and python (and Knime) and export it in PMML 4.2 format - 3- Import/deploy the PMML in our Java based application and use JPMML evaluator to execute it in production.

I am sure it's a common problem in machine learning as generally in Production JAVA is preferred over Python or R. Could you suggest what is the better approach(s) to create as well as execute a python/scikit based machine learning model in JVM based application.

What are your thought to achieve the steps # 2 and #3 more seamlessly in a JVM based application, without compromising performance and usability:-

1- Call a java program which internally calls the python scikit script (under the hood) to create a model in PMML and then use JPMML evaluator. It will pretend to the user that he is in a single JVM based application (better usability). I am not sure what are the limitations and short coming of using PMML as not all features are supported in jpmml-sklearn. 2- Call a java program which internally calls the python script and do the model creation as well as execution in an external python environment and serialized the model and the results in a file/csv or in memory DB (or cache, like hazelcast) from where the parent Java application will fetch the results etc.. I researched that I can’t use Jython for executing Sci-kit models. 3- Can I use Jep (Embed Python in Java) to embed Cpython in JVM ? Does anybody tried it for sci-kit models?

Alternatively, I should explore to use Mahout or weka - java based machine learning libraries in my JVM based application. (I need to support both windows and non-windows platforms)

I am also exploring H2Oai which is java based. Does anybody tried it.

like image 797
Gaurav Gupta Avatar asked Nov 08 '22 13:11

Gaurav Gupta


1 Answers

I use IntelliJ IDEA with the python plugin. This way I have both java and python code in one and the same project. The data is in the database; the connection is always visible and accessible, independently of whether I have a .java or a .py file currently in the editor. In the list of configurations you can have Python scripts, Java applications, maven goals etc. Therefore I don't think you have to mix Python and Java code together (by calling Python scripts out of Java). That is completely unnecessary.

My workflow is (everything in IntelliJ IDEA): 1. Prepare the data (usually SQL) 2. Run python script, which applies a pipeline of transformators to the pandas data frame constructed from a certain database table and outputs a PMML. 3. Use the scikit-learn model in your java application.

like image 156
Volokh Avatar answered Nov 14 '22 22:11

Volokh