What is difference between pyspark mllib
and pyspark ml
packages ? :
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html
pyspark mllib
appears to be target algorithms at dataframe level pyspark ml
One difference I found is pyspark ml
implements pyspark.ml.tuning.CrossValidator
while pyspark mllib
does not.
My understanding is the library should use if implementing algorithms on Apache Spark framework is mllib
but there appears to be a split ?
There does not appear to be interoperability between each of the frameworks without transforming types as they each contain a different package structure.
Spark MLlib is used to perform machine learning in Apache Spark. MLlib consists of popular algorithms and utilities. MLlib in Spark is a scalable Machine learning library that discusses both high-quality algorithm and high speed.
spark.ml is a new package introduced in Spark 1.2, which aims to provide a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.
Is MLlib deprecated? No. MLlib includes both the RDD-based API and the DataFrame-based API. The RDD-based API is now in maintenance mode.
Machine Learning in PySpark is easy to use and scalable. It works on distributed systems. You can use Spark Machine Learning for data analysis. There are various techniques you can make use of with Machine Learning algorithms such as regression, classification, etc., all because of the PySpark MLlib.
From my experience pyspark.mllib
classes can only be used with pyspark.RDD
's, whereas (as you mention) pyspark.ml
classes can only be used with pyspark.sql.DataFrame
's. There is mention to support this in the documentation for pyspark.ml
, the first entry in pyspark.ml package
states:
DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.
Now I am reminded of an article I read a while back regarding the three API's available in Spark 2.0, their relative benefits/drawbacks and their comparative performance. A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets. I was in the midst of doing performance testing on new client servers and was interested if there would ever be a scenario in which it would be worth developing an RDD based approach as opposed to a DataFrame based approach (my approach of choice), but I digress.
The gist was that there are situations in which each are highly suited and others where they might not be. One example I remember is that if you data is already structured DataFrames confer some performance benefits over RDD's, this is apparently drastic as the complexity of your operations increase. Another observation was that DataSets and DataFrames consume far less memory when caching than RDD's. In summation the author concluded that for low level operations RDD's are great, but for high level operations, viewing, and tying with other API's DataFrames and DataSets are superior.
So to come back full circle to your question, I believe the answer is a resounding pyspark.ml
as the classes in this package are designed to utilize pyspark.sql.DataFrames
. I would imagine that the performance of complex algorithms implemented in each of these packages would be significant if you were to test against the same data structured as a DataFrame vs RDD. Furthermore, viewing the data and developing compelling visuals would be both more intuitive and have better performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With