What is the difference between Apache Mahout and Apache Spark's MLlib?

1 Answers

The main difference will come from underlying frameworks. In case of Mahout it is Hadoop MapReduce and in case of MLib it is Spark. To be more specific - from the difference in per job overhead
If your ML algorithm mapped to the single MR job - main difference will be only startup overhead, which is dozens of seconds for Hadoop MR, and let say 1 second for Spark. So in case of model training it is not that important.
Things will be different if your algorithm is mapped to many jobs. In this case we will have the same difference on overhead per iteration and it can be game changer.
Lets assume that we need 100 iterations, each needed 5 seconds of cluster CPU.

On Spark: it will take 100*5 + 100*1 seconds = 600 seconds.
On Hadoop: MR (Mahout) it will take 100*5+100*30 = 3500 seconds.

In the same time Hadoop MR is much more mature framework then Spark and if you have a lot of data, and stability is paramount - I would consider Mahout as serious alternative.

answered Sep 23 '22 17:09

David Gruzman

Related questions
                            
                                What is yarn-client mode in Spark?
                            
                                SparkR vs sparklyr [closed]
                            
                                Derive multiple columns from a single column in a Spark DataFrame
                            
                                What conditions should cluster deploy mode be used instead of client?
                            
                                View RDD contents in Python Spark?
                            
                                Spark load data and add filename as dataframe column
                            
                                Convert date from String to Date format in Dataframes
                            
                                PySpark: multiple conditions in when clause
                            
                                Find maximum row per group in Spark DataFrame
                            
                                Append a column to Data Frame in Apache Spark 1.3
                            
                                Pyspark replace strings in Spark dataframe column
                            
                                Explain the aggregate functionality in Spark (with Python and Scala)
                            
                                How do I detect if a Spark DataFrame has a column
                            
                                Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?
                            
                                Difference between == and === in Scala, Spark
                            
                                'PipelinedRDD' object has no attribute 'toDF' in PySpark
                            
                                Pyspark: Pass multiple columns in UDF
                            
                                Importing spark.implicits._ in scala
                            
                                Which operations preserve RDD order?
                            
                                Why does a job fail with "No space left on device", but df says otherwise?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between Apache Mahout and Apache Spark's MLlib?

Tags:

apache-spark

mahout

apache-spark-mllib

eliasah

People also ask

1 Answers

David Gruzman

Recent Activity

Donate For Us