Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

If data fits on a single machine does it make sense to use Spark?

I have 20GB of data that requires processing, all of this data fits on my local machine. I'm planning on using Spark or Scala parallel colleections to implement some algorithms and matrix multiplication against this data.

Since the data fits on a single machine should I use Scala parallel collections ?

Is this true : The main bottleneck in parallel tasks is getting the data to the CPU for processing, so since all of the data is as close as can be to the CPU Spark will not give any significant performance improvement ?

Spark will have the overhead setting up parallel tasks even though it will be just running on one machine, so this overhead is redundant in this case ?

like image 253
blue-sky Avatar asked May 28 '14 17:05

blue-sky


People also ask

Can you run Spark on a single machine?

In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.

Does data have to fit in-memory to use Spark?

Does my data need to fit in memory to use Spark? No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data.

What kind of data can be handled by Spark?

1 Answer. Spark Streaming framework helps in developing applications that can perform analytics on streaming, real-time data - such as analyzing video or social media data, in real-time. In fast-changing industries such as marketing, performing real-time analytics is very important.


1 Answers

It's hard to provide some non-obvious instructions, like if you had your data and doesn't goes up to the 80% of memory and ..., then use local mode. Having said this, there are couple of points, which, in general, may make you use spark even if your data fits one's machine memory:

  1. really intensive CPU processing, from the top of my head, it might be complicated parsing of texts
  2. stability -- say you have many processing stages and you don't want to lose results, once your single machine goes down. it's especially important in case you have recurrent calculations, not one-off queries (this way, time you spend on bringing spark to the table might pay-off)
  3. streaming -- you get your data from somewhere in a stream manner, and, though it's snapshot fits single machine, you have to orchestrate it somehow

In your particular case

so since all of the data is as close as can be to the CPU Spark will not give any significant performance improvement

Of course it's not, Spark is not a voodoo magic that somehow might get your data closer to the CPU, but it can help you scale among machines, thus CPUs (point #1)

Spark will have the overhead setting up parallel tasks even though it will be just running on one machine, so this overhead is redundant in this case ?

I may sound captain obvious, but

  1. Take #2 and #3 into consideration, do you need them? If yes, go spark or something else
  2. If no, implement your processing in a dumb way (parallel collections)
  3. Profile and take a look. Are your processing is CPU bound? Can you speed up it, without lot of tweaks? If no, go spark.

There is also [cheeky] point 4) in the list of Why should I use Spark?. It's the hype -- Spark is a very sexy technology which is easy to "sell" to both your devs (it's the cutting edge of big data) and the company (your boss, in case you're building your own product, your customer in case you're building product for somebody else).

like image 168
om-nom-nom Avatar answered Sep 17 '22 20:09

om-nom-nom