Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between spark with h2o and sparkling water

I have a few questions or doubts on sparkling water and why is it needed.

Lets assume that I have a generated h2o model with both binary and pojo.

Now I want to deploy the model into production and have an option for using pojo and binary (sparkling water) both.

  1. Which one should I use? Direct spark with pojo or sparkling water with Binary.
  2. What is the exact use of sparkling water, when we can easily deploy a model using pojo and spark itself?
  3. Is sparkling water needed only when you have to train model on huge amounts of data? Or it can be used in PROD deployments of model's as well.

Example: https://github.com/h2oai/h2o-droplets/blob/master/h2o-pojo-on-spark-droplet/src/main/scala/examples/PojoExample.scala

Uses spark to run a pojo model.

Example: https://github.com/h2oai/h2o-droplets/blob/master/sparkling-water-droplet/src/main/scala/water/droplets/SparklingWaterDroplet.scala

Trains / Runs a model in sparkling water.

What are the advantages which sparkling water h2o provides over normal spark?

like image 354
Lalit Agarwal Avatar asked Apr 05 '17 16:04

Lalit Agarwal


People also ask

What is H2O spark?

Spark is an elegant and powerful general-purpose, open-source, in-memory platform with tremendous momentum. H2O is an in-memory platform for machine learning that is reshaping how people apply math and predictive analytics to their business problems.

What is H2O outline the key features of H2O?

H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more.

What is H2O driverless AI?

H2O Driverless AI is an artificial intelligence (AI) platform for automatic machine learning. Driverless AI automates some of the most difficult data science and machine learning workflows such as feature engineering, model validation, model tuning, model selection, and model deployment.

What is Apache spark for?

What is Apache Spark? Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.


1 Answers

  1. Which one should I use? Direct spark with pojo or sparkling water with Binary.

    • There is no 'right' answer, it depends on your use case. It sounds like what you want is the POJO/MOJO in Spark, so you can do scoring without the added dependency of having an H2O cluster up.
  2. What is the exact use of sparkling water, when we can easily deploy a model using pojo and spark itself?

    • The exact use of Sparkling Water is to have an H2O available within a Spark context. This is particularly useful for training: you can leverage Spark's many data connectors, munging capabilities etc. POJO/MOJO + Spark is sufficient for scoring
  3. Is sparkling water needed only when you have to train model on huge amounts of data? Or it can be used in PROD deployments of model's as well.

    • Sparkling Water is needed when you want to leverage H2O's algorithms in a context that plays nicely w/ the Spark ecosystem.

If putting a model in "production" means having "always on" scoring exposed as a REST endpoint or similar: the POJO/MOJO is the way you want to go (H2O clusters are not highly available). You'll need to make sure you're handling incoming data correctly yourself though.

If you are doing batch scoring, nightly or otherwise, then it may make sense to use the binary model w/ Sparkling Water because parsing incoming data becomes trivial (asH2OFrame(..)) and scoring is easy as predict()

like image 83
Nick Karpov Avatar answered Nov 13 '22 08:11

Nick Karpov