Considering a MySQL products
database with 10 millions products for an e-commerce website.
I'm trying to set up a classification module to categorize products. I'm using Apache Sqoop to import data from MySQL to Hadoop.
I wanted to use Mahout over it as a Machine Learning framework to use one of it's Classification algorithms, and then I ran into Spark which is provided with MLlib
Spark with MLlib proved to be nine times faster than Apache Mahout in a Hadoop disk-based environment.
mllib is the first of the two Spark APIs while org.apache.spark.ml is the new API. spark. mllib carries the original API built on top of RDDs. spark.ml contains higher-level API built on top of DataFrames for constructing ML pipelines.
MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives. Guides for individual algorithms are listed below.
Apache Mahout is a highly scalable machine learning library that enables developers to use optimized algorithms. Mahout implements popular machine learning techniques such as recommendation, classification, and clustering. Therefore, it is prudent to have a brief section on machine learning before we move further.
The main difference will come from underlying frameworks. In case of Mahout it is Hadoop MapReduce and in case of MLib it is Spark. To be more specific - from the difference in per job overhead
If your ML algorithm mapped to the single MR job - main difference will be only startup overhead, which is dozens of seconds for Hadoop MR, and let say 1 second for Spark. So in case of model training it is not that important.
Things will be different if your algorithm is mapped to many jobs. In this case we will have the same difference on overhead per iteration and it can be game changer.
Lets assume that we need 100 iterations, each needed 5 seconds of cluster CPU.
In the same time Hadoop MR is much more mature framework then Spark and if you have a lot of data, and stability is paramount - I would consider Mahout as serious alternative.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With