Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between Apache Mahout and Apache Spark's MLlib?

Considering a MySQL products database with 10 millions products for an e-commerce website.

I'm trying to set up a classification module to categorize products. I'm using Apache Sqoop to import data from MySQL to Hadoop.

I wanted to use Mahout over it as a Machine Learning framework to use one of it's Classification algorithms, and then I ran into Spark which is provided with MLlib

  • So what is the difference between the two frameworks?
  • Mainly, what are the advantages,down-points and limitations of each?
like image 238
eliasah Avatar asked May 07 '14 07:05

eliasah


People also ask

How many times faster is MLlib vs Apache Mahout?

Spark with MLlib proved to be nine times faster than Apache Mahout in a Hadoop disk-based environment.

What is the main difference between spark MLlib and ML?

mllib is the first of the two Spark APIs while org.apache.spark.ml is the new API. spark. mllib carries the original API built on top of RDDs. spark.ml contains higher-level API built on top of DataFrames for constructing ML pipelines.

What is Apache Spark MLlib?

MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives. Guides for individual algorithms are listed below.

How does Apache Mahout work?

Apache Mahout is a highly scalable machine learning library that enables developers to use optimized algorithms. Mahout implements popular machine learning techniques such as recommendation, classification, and clustering. Therefore, it is prudent to have a brief section on machine learning before we move further.


1 Answers

The main difference will come from underlying frameworks. In case of Mahout it is Hadoop MapReduce and in case of MLib it is Spark. To be more specific - from the difference in per job overhead
If your ML algorithm mapped to the single MR job - main difference will be only startup overhead, which is dozens of seconds for Hadoop MR, and let say 1 second for Spark. So in case of model training it is not that important.
Things will be different if your algorithm is mapped to many jobs. In this case we will have the same difference on overhead per iteration and it can be game changer.
Lets assume that we need 100 iterations, each needed 5 seconds of cluster CPU.

  • On Spark: it will take 100*5 + 100*1 seconds = 600 seconds.
  • On Hadoop: MR (Mahout) it will take 100*5+100*30 = 3500 seconds.

In the same time Hadoop MR is much more mature framework then Spark and if you have a lot of data, and stability is paramount - I would consider Mahout as serious alternative.

like image 58
David Gruzman Avatar answered Sep 23 '22 17:09

David Gruzman