Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between org.apache.spark.ml.classification and org.apache.spark.mllib.classification

I'm writing a spark application and would like to use algorithms in MLlib. In the API doc I found two different classes for the same algorithm.

For example, there is one LogisticRegression in org.apache.spark.ml.classification also a LogisticRegressionwithSGD in org.apache.spark.mllib.classification.

The only difference I can find is that the one in org.apache.spark.ml is inherited from Estimator and was able to be used in cross validation. I was quite confused that they are placed in different packages. Is there anyone know the reason for it? Thanks!

like image 953
ailzhang Avatar asked May 14 '15 07:05

ailzhang


People also ask

What is the difference between Spark ml and Spark MLlib?

mllib is the first of the two Spark APIs while org.apache.spark.ml is the new API. spark. mllib carries the original API built on top of RDDs. spark.ml contains higher-level API built on top of DataFrames for constructing ML pipelines.

What is Apache Spark MLlib?

MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives. Guides for individual algorithms are listed below.

Is MLlib part of Spark?

Community. MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each Spark release.

Why is MLlib switching to the DataFrame based API?

Why is MLlib switching to the DataFrame-based API? DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.


2 Answers

It's JIRA ticket

And From Design Doc:

MLlib now covers a basic selection of machine learning algorithms, e.g., logistic regression, decision trees, alternating least squares, and k-means. The current set of APIs contains several design flaws that prevent us moving forward to address practical machine learning pipelines, make MLlib itself a scalable project.

The new set of APIs will live under org.apache.spark.ml, and o.a.s.mllib will be deprecated once we migrate all features to o.a.s.ml.

like image 136
yjshen Avatar answered Oct 11 '22 11:10

yjshen


The spark mllib guide says:

spark.mllib contains the original API built on top of RDDs.

spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

and

Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. But we will keep supporting spark.mllib along with the development of spark.ml. Users should be comfortable using spark.mllib features and expect more features coming. Developers should contribute new algorithms to spark.ml if they fit the ML pipeline concept well, e.g., feature extractors and transformers.

I think the doc explains it very well.

like image 31
JasonWayne Avatar answered Oct 11 '22 11:10

JasonWayne