Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is Apache Spark good for lots of small, fast computations and a few big, non-interactive ones?

I'm evaluating Apache Spark to see if it's good platform for the following requirements:

  • Cloud computing environment.
  • Commodity hardware.
  • Distributed DB (e.g. HBase) with possibly a few petabytes of data.
  • Lots of simultaneous small computations that need to complete fast (within seconds). Small means 1-100 MBs of data.
  • A few large computations that don't need to complete fast (hours is fine). Large means 10-1000 GBs of data.
  • Very rarely, very large computations that don't need to complete fast (days is fine). Very large means 10-100 TBs of data.
  • All computations are mutually independent.
  • Real-time data stream incoming for some of the computations.
  • Machine learning involved.

Having read a bit about Spark, I see the following advantages:

  • Runs well on commodity hardware and with HBase/Cassandra.
  • MLlib for machine learning.
  • Spark Streaming for real-time data.
  • While MapReduce doesn't seem strictly necessary, maybe it could speed things up, and would let us adapt if the requirements became tighter in the future.

These are the main questions I still have:

  • Can it do small computations very fast?
  • Will it load-balance a large number of simultaneous small computations?

I also wonder if I'm generally not trying to use Spark for a purpose it wasn't designed for, not using the main advantages: MapReduce and in-memory RDDs. If so, I'd also welcome a suggestion for an alternative. Many thanks!

like image 823
Jan Żankowski Avatar asked Jul 12 '14 11:07

Jan Żankowski


2 Answers

Small computations fast

We do use Spark in an interactive setting, as the backend of a web interface. Sub-second latencies are possible, but not easy. Some tips:

  • Create SparkContext on start up. It takes a few seconds to get connected and get the executors started on the workers.
  • You mention many simultaneous computations. Instead of each user having their own SparkContext and own set of executors, have just one that everyone can share. In our case multiple users can use the web interface concurrently, but there's only one web server.
  • Operate on memory cached RDDs. Serialization is probably too slow, so use the default caching, not Tachyon. If you cannot avoid serialization, use Kryo. It is way faster than stock Java serialization.
  • Use RDD.sample liberally. An unbiased sample is often good enough for interactive exploration.

Load balancing

Load balancing of operations is a good question. We will have to tackle this as well, but have not done it yet. In the default setup everything is processed in a first-in-first-out manner. Each operation gets the full resources of the cluster and the next operation has to wait. This is fine if each operation is fast, but what if one isn't?

The alternative fair scheduler likely solves this issue, but I have not tried it yet.

Spark can also off-load scheduling to YARN or Mesos, but I have no experience with this. I doubt they are compatible with your latency requirements.

like image 116
Daniel Darabos Avatar answered Nov 14 '22 10:11

Daniel Darabos


I think the short answer is "yes". Spark advertises itself as "near real-time". The one or two papers I've read describe throughput latency as either one second or several seconds. For best performance, look at combining it with Tachyon, an in-memory distributed file system.

As for load-balancing, later releases of Spark can use a round-robin scheduler so that both large and small jobs can co-exist in the same cluster:

Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.

However, I am not clear on what you mean by "...also wonder if I'm generally not trying to use Spark for a purpose it wasn't designed for, not using the main advantages: MapReduce and in-memory RDDs."

I might be wrong, but I don't think you can use Spark without RDDs. Spark is distributed computing with RDDs. What kind of jobs are you trying to run if not MapReduce style jobs? If your use cases are not a good fit for the example use cases that provided with the Spark documentation or tutorials, then what do your use cases fit? Hadoop/Spark shine when there are tons of data, and very little iterative computation on the data. For example, solving systems of equations is not a traditional use case for these technologies.

Is there a need to distribute jobs that only involve 1-100 MB? Such small amounts of data are often processed most quickly on a single powerful node. If there is a reason to distribute the computation, look at running MPI under Mesos. A lot jobs that fall under the name "scientific computing" continue to use MPI as the distributed computing model.

If the jobs are about crunching numbers (e.g. matrix multiplication), then small-medium jobs can be handled quickly with GPU computing on a single node. I've used Nvidia's CUDA programming environment. It rocks for computationally intensive tasks like computer vision.

What is the nature of the jobs that will run in your environment?

like image 21
ahoffer Avatar answered Nov 14 '22 10:11

ahoffer