I have a couple of use cases for Apache Spark applications/scripts, generally of the following form: General ETL use case - more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families. Streaming use case - realtime analysis of the events as they arrive in the system. For (1), I'll need to kick off the Spark application periodically. For (2), just kick off the long running Spark Streaming process at boot time and let it go. (Note - I'm using Spark Standalone as the cluster manager, so no yarn or mesos) I'm trying to figure out the most common / best practice deployment strategies for Spark applications. So far the options I can see are: <ol> <li> Deploying my program as a jar, and running the various tasks with spark-submit - which seems to be the way recommended in the spark docs. Some thoughts about this strategy: <ul> <li>how do you start/stop tasks - just using simple bash scripts?</li> <li>how is scheduling managed? - simply use cron?</li> <li>any resilience? (e.g. Who schedules the jobs to run if the driver server dies?)</li> </ul> </li> <li> Creating a separate webapp as the driver program. <ul> <li>creates a spark context programmatically to talk to the spark cluster</li> <li>allowing users to kick off tasks through the http interface</li> <li>using Quartz (for example) to manage scheduling</li> <li>could use cluster with zookeeper election for resilience</li> </ul> </li> <li> Spark job server (https://github.com/ooyala/spark-jobserver) <ul> <li>I don't think there's much benefit over (2) for me, as I don't (yet) have many teams and projects talking to Spark, and would still need some app to talk to job server anyway</li> <li>no scheduling built in as far as I can see</li> </ul> </li> </ol> I'd like to understand the general consensus w.r.t a simple but robust deployment strategy - I haven't been able to determine one by trawling the web, as of yet. Thanks very much!

Even though you are not using Mesos for Spark, you could have a look at -Chronos offering a distributed and fault tolerant cron -Marathon a Mesos framework for long running applications Note that this doesn't mean you have to move your spark deployment to mesos, e.g. you could just use chronos to trigger the spark -submit. I hope I understood your problem correctly and this helps you a bit!

Apache Spark application deployment best practices

Tags:

I have a couple of use cases for Apache Spark applications/scripts, generally of the following form:

General ETL use case - more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families.

Streaming use case - realtime analysis of the events as they arrive in the system.

For (1), I'll need to kick off the Spark application periodically.

For (2), just kick off the long running Spark Streaming process at boot time and let it go.

(Note - I'm using Spark Standalone as the cluster manager, so no yarn or mesos)

I'm trying to figure out the most common / best practice deployment strategies for Spark applications.

So far the options I can see are:

Deploying my program as a jar, and running the various tasks with spark-submit - which seems to be the way recommended in the spark docs. Some thoughts about this strategy:
- how do you start/stop tasks - just using simple bash scripts?
- how is scheduling managed? - simply use cron?
- any resilience? (e.g. Who schedules the jobs to run if the driver server dies?)
Creating a separate webapp as the driver program.
- creates a spark context programmatically to talk to the spark cluster
- allowing users to kick off tasks through the http interface
- using Quartz (for example) to manage scheduling
- could use cluster with zookeeper election for resilience
Spark job server (https://github.com/ooyala/spark-jobserver)
- I don't think there's much benefit over (2) for me, as I don't (yet) have many teams and projects talking to Spark, and would still need some app to talk to job server anyway
- no scheduling built in as far as I can see

I'd like to understand the general consensus w.r.t a simple but robust deployment strategy - I haven't been able to determine one by trawling the web, as of yet.

Thanks very much!

587

asked May 23 '15 13:05

lucas1000001

1 Answers

Even though you are not using Mesos for Spark, you could have a look at

-Chronos offering a distributed and fault tolerant cron

-Marathon a Mesos framework for long running applications

Note that this doesn't mean you have to move your spark deployment to mesos, e.g. you could just use chronos to trigger the spark -submit.

I hope I understood your problem correctly and this helps you a bit!

167

answered Oct 13 '22 01:10

js84

Related questions
                            
                                Explicit implementation of an interface using a getter-only auto-property (C# 6 feature)
                            
                                Google Places API: How to use multiple types?
                            
                                API to capture Live Photos in iOS9
                            
                                Debugging UIKit crash [UINavigationController initWithRootViewController]
                            
                                Extension of type Array with constraints cannot have an inheritance clause - swift 2 [duplicate]
                            
                                data.table: why is it not always possible to pass column names directly?
                            
                                hostapd repeating "deauthenticated due to local deauth request"
                            
                                android search: customize suggestion layout
                            
                                What Do We Use for Android N Network Security Configuration for a Self-Signed Certificate?
                            
                                EXC_BREAKPOINT (SIGTRAP) for App Review Team. Not Reproducible
                            
                                What is org.eclipse.m2e.MAVEN2_CLASSPATH_CONTAINER and how do I make it work in IntelliJ?
                            
                                Optimization barrier for microbenchmarks in MSVC: tell the optimizer you clobber memory?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With