Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

`pyspark mllib` versus `pyspark ml` packages

What is difference between pyspark mllib and pyspark ml packages ? :

https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html

pyspark mllib appears to be target algorithms at dataframe level pyspark ml

One difference I found is pyspark ml implements pyspark.ml.tuning.CrossValidator while pyspark mllib does not.

My understanding is the library should use if implementing algorithms on Apache Spark framework is mllib but there appears to be a split ?

There does not appear to be interoperability between each of the frameworks without transforming types as they each contain a different package structure.

like image 698
blue-sky Avatar asked Apr 05 '17 19:04

blue-sky


People also ask

How is MLlib library and ML library different in spark?

Spark MLlib is used to perform machine learning in Apache Spark. MLlib consists of popular algorithms and utilities. MLlib in Spark is a scalable Machine learning library that discusses both high-quality algorithm and high speed.

What is PySpark ML?

spark.ml is a new package introduced in Spark 1.2, which aims to provide a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.

Is spark MLlib deprecated?

Is MLlib deprecated? No. MLlib includes both the RDD-based API and the DataFrame-based API. The RDD-based API is now in maintenance mode.

Is PySpark good for machine learning?

Machine Learning in PySpark is easy to use and scalable. It works on distributed systems. You can use Spark Machine Learning for data analysis. There are various techniques you can make use of with Machine Learning algorithms such as regression, classification, etc., all because of the PySpark MLlib.


Video Answer


1 Answers

From my experience pyspark.mllib classes can only be used with pyspark.RDD's, whereas (as you mention) pyspark.ml classes can only be used with pyspark.sql.DataFrame's. There is mention to support this in the documentation for pyspark.ml, the first entry in pyspark.ml package states:

DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.

Now I am reminded of an article I read a while back regarding the three API's available in Spark 2.0, their relative benefits/drawbacks and their comparative performance. A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets. I was in the midst of doing performance testing on new client servers and was interested if there would ever be a scenario in which it would be worth developing an RDD based approach as opposed to a DataFrame based approach (my approach of choice), but I digress.

The gist was that there are situations in which each are highly suited and others where they might not be. One example I remember is that if you data is already structured DataFrames confer some performance benefits over RDD's, this is apparently drastic as the complexity of your operations increase. Another observation was that DataSets and DataFrames consume far less memory when caching than RDD's. In summation the author concluded that for low level operations RDD's are great, but for high level operations, viewing, and tying with other API's DataFrames and DataSets are superior.

So to come back full circle to your question, I believe the answer is a resounding pyspark.ml as the classes in this package are designed to utilize pyspark.sql.DataFrames. I would imagine that the performance of complex algorithms implemented in each of these packages would be significant if you were to test against the same data structured as a DataFrame vs RDD. Furthermore, viewing the data and developing compelling visuals would be both more intuitive and have better performance.

like image 185
Grr Avatar answered Sep 30 '22 19:09

Grr