`pyspark mllib` versus `pyspark ml` packages

Video Answer

1 Answers

From my experience pyspark.mllib classes can only be used with pyspark.RDD's, whereas (as you mention) pyspark.ml classes can only be used with pyspark.sql.DataFrame's. There is mention to support this in the documentation for pyspark.ml, the first entry in pyspark.ml package states:

DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.

Now I am reminded of an article I read a while back regarding the three API's available in Spark 2.0, their relative benefits/drawbacks and their comparative performance. A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets. I was in the midst of doing performance testing on new client servers and was interested if there would ever be a scenario in which it would be worth developing an RDD based approach as opposed to a DataFrame based approach (my approach of choice), but I digress.

The gist was that there are situations in which each are highly suited and others where they might not be. One example I remember is that if you data is already structured DataFrames confer some performance benefits over RDD's, this is apparently drastic as the complexity of your operations increase. Another observation was that DataSets and DataFrames consume far less memory when caching than RDD's. In summation the author concluded that for low level operations RDD's are great, but for high level operations, viewing, and tying with other API's DataFrames and DataSets are superior.

So to come back full circle to your question, I believe the answer is a resounding pyspark.ml as the classes in this package are designed to utilize pyspark.sql.DataFrames. I would imagine that the performance of complex algorithms implemented in each of these packages would be significant if you were to test against the same data structured as a DataFrame vs RDD. Furthermore, viewing the data and developing compelling visuals would be both more intuitive and have better performance.

185

answered Sep 30 '22 19:09

Grr

Related questions
                            
                                Custom 'usage' function in argparse?
                            
                                Drawing Bounding box around given size Area contour
                            
                                Why is FrozenList different from tuple?
                            
                                Bokeh Plot with equal axes
                            
                                Threaded, non-blocking websocket client
                            
                                Reason why numpy rollaxis is so confusing?
                            
                                Comments in continuation lines
                            
                                Stop Django from creating migrations if the list of choices of a field changes
                            
                                Nested Blueprints in Flask?
                            
                                "ValueError: embedded null character" when using open()
                            
                                Pandas-style transform of grouped data on PySpark DataFrame
                            
                                Python, importing modules for testing
                            
                                What numbers that I can put in numpy.random.seed()?
                            
                                Adding extra functionality to parent class method without changing its name [duplicate]
                            
                                How to change User representation in Django Admin when used as Foreign Key?
                            
                                Celery and Flask in same docker-compose
                            
                                How to Use Lagged Time-Series Variables in a Python Pandas Regression Model?
                            
                                Ordering boxplot x-axis in seaborn
                            
                                Pandas pivot_table, sort values by columns
                            
                                How do I use Boto3 to launch an EC2 instance with an IAM role?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

`pyspark mllib` versus `pyspark ml` packages

Tags:

python

python-3.x

apache-spark

pyspark

blue-sky

People also ask

Video Answer

1 Answers

Grr

Recent Activity

Donate For Us