Following the Spark MLlib Guide we can read that Spark has two machine learning libraries:
spark.mllib
, built on top of RDDs.spark.ml
, built on top of Dataframes.According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible.
The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining,Naive Bayes, etc.) and spark.ml
(for dataframes) don't provide such methods, only spark.mllib
(for RDDs) provides this algorithms.
If Dataframes are better than RDDs and the referred guide recommends the use of spark.ml
, why aren't common machine learning methods implemented in that lib?
mllib is the first of the two Spark APIs while org.apache.spark.ml is the new API. spark. mllib carries the original API built on top of RDDs. spark.ml contains higher-level API built on top of DataFrames for constructing ML pipelines.
Is MLlib deprecated? No. MLlib includes both the RDD-based API and the DataFrame-based API. The RDD-based API is now in maintenance mode.
Spark supports multiple widely used programming languages (Python, Java, Scala and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers.
Spark 2.0.0
Currently Spark moves strongly towards DataFrame
API with ongoing deprecation of RDD API. While number of native "ML" algorithms is growing the main points highlighted below are still valid and internally many stages are implemented directly using RDDs.
See also: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0
Spark < 2.0.0
I guess that the main missing point is that spark.ml
algorithms in general don't operate on DataFrames. So in practice it is more a matter of having a ml
wrapper than anything else. Even native ML implementation (like ml.recommendation.ALS
use RDDs
internally).
Why not implement everything from scratch on top of DataFrames? Most likely because only a very small subset of machine learning algorithms can actually benefit from the optimizations which are currently implemented in Catalyst not to mention be efficiently and naturally implemented using DataFrame API / SQL.
DataFrame
/ Dataset
VectorUDT
. There is one more problem with DataFrames, which is not really related to machine learning. When you decide to use a DataFrame in your code you give away almost all benefits of static typing and type inference. It is highly subjective if you consider it to be a problem or not but one thing for sure, it doesn't feel natural in Scala world.
Regarding better, newer and faster I would take a look at Deep Dive into Spark SQL’s Catalyst Optimizer, in particular the part related to quasiquotes:
The following figure shows that quasiquotes let us generate code with performance similar to hand-tuned programs.
* This has been changed in Spark 1.6 but it is still limited to default HashPartitioning
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With