I'm evaluating Apache Spark to see if it's good platform for the following requirements:
Having read a bit about Spark, I see the following advantages:
These are the main questions I still have:
I also wonder if I'm generally not trying to use Spark for a purpose it wasn't designed for, not using the main advantages: MapReduce and in-memory RDDs. If so, I'd also welcome a suggestion for an alternative. Many thanks!
We do use Spark in an interactive setting, as the backend of a web interface. Sub-second latencies are possible, but not easy. Some tips:
SparkContext
on start up. It takes a few seconds to get connected and get the executors started on the workers.SparkContext
and own set of executors, have just one that everyone can share. In our case multiple users can use the web interface concurrently, but there's only one web server.RDD.sample
liberally. An unbiased sample is often good enough for interactive exploration.Load balancing of operations is a good question. We will have to tackle this as well, but have not done it yet. In the default setup everything is processed in a first-in-first-out manner. Each operation gets the full resources of the cluster and the next operation has to wait. This is fine if each operation is fast, but what if one isn't?
The alternative fair scheduler likely solves this issue, but I have not tried it yet.
Spark can also off-load scheduling to YARN or Mesos, but I have no experience with this. I doubt they are compatible with your latency requirements.
I think the short answer is "yes". Spark advertises itself as "near real-time". The one or two papers I've read describe throughput latency as either one second or several seconds. For best performance, look at combining it with Tachyon, an in-memory distributed file system.
As for load-balancing, later releases of Spark can use a round-robin scheduler so that both large and small jobs can co-exist in the same cluster:
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
However, I am not clear on what you mean by "...also wonder if I'm generally not trying to use Spark for a purpose it wasn't designed for, not using the main advantages: MapReduce and in-memory RDDs."
I might be wrong, but I don't think you can use Spark without RDDs. Spark is distributed computing with RDDs. What kind of jobs are you trying to run if not MapReduce style jobs? If your use cases are not a good fit for the example use cases that provided with the Spark documentation or tutorials, then what do your use cases fit? Hadoop/Spark shine when there are tons of data, and very little iterative computation on the data. For example, solving systems of equations is not a traditional use case for these technologies.
Is there a need to distribute jobs that only involve 1-100 MB? Such small amounts of data are often processed most quickly on a single powerful node. If there is a reason to distribute the computation, look at running MPI under Mesos. A lot jobs that fall under the name "scientific computing" continue to use MPI as the distributed computing model.
If the jobs are about crunching numbers (e.g. matrix multiplication), then small-medium jobs can be handled quickly with GPU computing on a single node. I've used Nvidia's CUDA programming environment. It rocks for computationally intensive tasks like computer vision.
What is the nature of the jobs that will run in your environment?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With