I have a few questions or doubts on sparkling water and why is it needed.
Lets assume that I have a generated h2o model with both binary and pojo.
Now I want to deploy the model into production and have an option for using pojo and binary (sparkling water) both.
Example: https://github.com/h2oai/h2o-droplets/blob/master/h2o-pojo-on-spark-droplet/src/main/scala/examples/PojoExample.scala
Uses spark to run a pojo model.
Example: https://github.com/h2oai/h2o-droplets/blob/master/sparkling-water-droplet/src/main/scala/water/droplets/SparklingWaterDroplet.scala
Trains / Runs a model in sparkling water.
What are the advantages which sparkling water h2o provides over normal spark?
Spark is an elegant and powerful general-purpose, open-source, in-memory platform with tremendous momentum. H2O is an in-memory platform for machine learning that is reshaping how people apply math and predictive analytics to their business problems.
H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more.
H2O Driverless AI is an artificial intelligence (AI) platform for automatic machine learning. Driverless AI automates some of the most difficult data science and machine learning workflows such as feature engineering, model validation, model tuning, model selection, and model deployment.
What is Apache Spark? Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.
Which one should I use? Direct spark with pojo or sparkling water with Binary.
What is the exact use of sparkling water, when we can easily deploy a model using pojo and spark itself?
Is sparkling water needed only when you have to train model on huge amounts of data? Or it can be used in PROD deployments of model's as well.
If putting a model in "production" means having "always on" scoring exposed as a REST endpoint or similar: the POJO/MOJO is the way you want to go (H2O clusters are not highly available). You'll need to make sure you're handling incoming data correctly yourself though.
If you are doing batch scoring, nightly or otherwise, then it may make sense to use the binary model w/ Sparkling Water because parsing incoming data becomes trivial (asH2OFrame(..)) and scoring is easy as predict()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With