I have a couple of use cases for Apache Spark applications/scripts, generally of the following form:
General ETL use case - more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families.
Streaming use case - realtime analysis of the events as they arrive in the system.
For (1), I'll need to kick off the Spark application periodically.
For (2), just kick off the long running Spark Streaming process at boot time and let it go.
(Note - I'm using Spark Standalone as the cluster manager, so no yarn or mesos)
I'm trying to figure out the most common / best practice deployment strategies for Spark applications.
So far the options I can see are:
Deploying my program as a jar, and running the various tasks with spark-submit - which seems to be the way recommended in the spark docs. Some thoughts about this strategy:
Creating a separate webapp as the driver program.
Spark job server (https://github.com/ooyala/spark-jobserver)
I'd like to understand the general consensus w.r.t a simple but robust deployment strategy - I haven't been able to determine one by trawling the web, as of yet.
Thanks very much!
Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed Big Data service that lets you run Apache Spark applications at any scale with no administration.
Even though you are not using Mesos for Spark, you could have a look at
-Chronos offering a distributed and fault tolerant cron
-Marathon a Mesos framework for long running applications
Note that this doesn't mean you have to move your spark deployment to mesos, e.g. you could just use chronos to trigger the spark -submit.
I hope I understood your problem correctly and this helps you a bit!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With