I noticed there are two <code>LinearRegressionModel</code> classes in SparkML, one in ML package (<code>spark.ml</code>) and another one in <code>MLLib</code> (<code>spark.mllib</code>) package. These two are implemented quite differently - e.g. the one from <code>MLLib</code> implements <code>Serializable</code>, while the other one does not. By the way, the same is true about <code>RandomForestModel</code> or <code>Word2Vec</code>. Why are there two classes? Which is the "right" one? And is there a way to convert one into another?

<code>o.a.s.mllib</code> contains old RDD-based API while <code>o.a.s.ml</code> contains new API build around <code>Dataset</code> and ML Pipelines. <code>ml</code> and <code>mllib</code> reached feature parity in 2.0.0 and <code>mllib</code> is slowly being deprecated (this already happened in case of linear regression) and most likely will be removed in the next major release. So unless your goal is backward compatibility then the "right choice" is <code>o.a.s.ml</code>.

Spark Mllib spark.mllib contains the legacy API built on top of RDDs. Spark ML spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines. According to the official announcement <blockquote> As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. Apache spark is recommended to use spark.ml </blockquote> <blockquote> <ul> <li> MLlib will still support the RDD-based API in spark.mllib with bug fixes. </li> <li> MLlib will not add new features to the RDD-based API. </li> <li> In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API. </li> <li> After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated. </li> <li> The RDD-based API is expected to be removed in Spark 3.0. </li> </ul> </blockquote> <blockquote> Why is MLlib switching to the DataFrame-based API? </blockquote> <blockquote> <ul> <li> DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages. </li> <li> The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages. </li> <li> DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details. </li> </ul> </blockquote> For more info: Machine Learning Library (MLlib) Guide

What's the difference between Spark ML and MLLIB packages

2 Answers

o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML Pipelines. ml and mllib reached feature parity in 2.0.0 and mllib is slowly being deprecated (this already happened in case of linear regression) and most likely will be removed in the next major release.

So unless your goal is backward compatibility then the "right choice" is o.a.s.ml.

159

answered Sep 29 '22 18:09

zero323

Spark Mllib

spark.mllib contains the legacy API built on top of RDDs.

Spark ML

spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

According to the official announcement

As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. Apache spark is recommended to use spark.ml

MLlib will still support the RDD-based API in spark.mllib with bug fixes.

MLlib will not add new features to the RDD-based API.

In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.

After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.

The RDD-based API is expected to be removed in Spark 3.0.

Why is MLlib switching to the DataFrame-based API?

DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.

The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.

DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.

For more info: Machine Learning Library (MLlib) Guide

answered Sep 29 '22 18:09

vaquar khan

Related questions
                            
                                Filtering a Pyspark DataFrame with SQL-like IN clause
                            
                                What is a task in Spark? How does the Spark worker execute the jar file?
                            
                                Difference between DataSet API and DataFrame API [duplicate]
                            
                                Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)
                            
                                How to optimize shuffle spill in Apache Spark application
                            
                                What is the Spark DataFrame method `toPandas` actually doing?
                            
                                Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?
                            
                                Installing of SparkR
                            
                                Flattening Rows in Spark
                            
                                dataframe: how to groupBy/count then filter on count in Scala
                            
                                Spark Window Functions - rangeBetween dates
                            
                                What is the difference between cube, rollup and groupBy operators?
                            
                                Reduce a key-value pair into a key-list pair with Apache Spark
                            
                                How to deal with executor memory and driver memory in Spark?
                            
                                How to reduce the verbosity of Spark's runtime output?
                            
                                Spark iterate HDFS directory
                            
                                Spark unionAll multiple dataframes
                            
                                get datatype of column using pyspark
                            
                                Spark specify multiple column conditions for dataframe join
                            
                                How to export data from Spark SQL to CSV

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the difference between Spark ML and MLLIB packages

Tags:

apache-spark

apache-spark-ml

apache-spark-mllib

vyakhir

People also ask

2 Answers

zero323

vaquar khan

Recent Activity

Donate For Us