Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between Spark ML and MLLIB packages

I noticed there are two LinearRegressionModel classes in SparkML, one in ML package (spark.ml) and another one in MLLib (spark.mllib) package.

These two are implemented quite differently - e.g. the one from MLLib implements Serializable, while the other one does not.

By the way, the same is true about RandomForestModel or Word2Vec.

Why are there two classes? Which is the "right" one? And is there a way to convert one into another?

like image 541
vyakhir Avatar asked Aug 08 '16 18:08

vyakhir


People also ask

Is MLlib part of Spark?

Community. MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each Spark release.

What is Spark MLlib used for?

MLlib is Spark's machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering.

Why is MLlib switching to the DataFrame based API?

Why is MLlib switching to the DataFrame-based API? DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.


2 Answers

o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML Pipelines. ml and mllib reached feature parity in 2.0.0 and mllib is slowly being deprecated (this already happened in case of linear regression) and most likely will be removed in the next major release.

So unless your goal is backward compatibility then the "right choice" is o.a.s.ml.

like image 159
zero323 Avatar answered Sep 29 '22 18:09

zero323


Spark Mllib

spark.mllib contains the legacy API built on top of RDDs.

Spark ML

spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

According to the official announcement

As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. Apache spark is recommended to use spark.ml

  • MLlib will still support the RDD-based API in spark.mllib with bug fixes.

  • MLlib will not add new features to the RDD-based API.

  • In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.

  • After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.

  • The RDD-based API is expected to be removed in Spark 3.0.

Why is MLlib switching to the DataFrame-based API?

  • DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.

  • The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.

  • DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.

For more info: Machine Learning Library (MLlib) Guide

like image 20
vaquar khan Avatar answered Sep 29 '22 18:09

vaquar khan