I have 20GB of data that requires processing, all of this data fits on my local machine. I'm planning on using Spark or Scala parallel colleections to implement some algorithms and matrix multiplication against this data.
Since the data fits on a single machine should I use Scala parallel collections ?
Is this true : The main bottleneck in parallel tasks is getting the data to the CPU for processing, so since all of the data is as close as can be to the CPU Spark will not give any significant performance improvement ?
Spark will have the overhead setting up parallel tasks even though it will be just running on one machine, so this overhead is redundant in this case ?
In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.
Does my data need to fit in memory to use Spark? No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data.
1 Answer. Spark Streaming framework helps in developing applications that can perform analytics on streaming, real-time data - such as analyzing video or social media data, in real-time. In fast-changing industries such as marketing, performing real-time analytics is very important.
It's hard to provide some non-obvious instructions, like if you had your data and doesn't goes up to the 80% of memory and ..., then use local mode. Having said this, there are couple of points, which, in general, may make you use spark even if your data fits one's machine memory:
In your particular case
so since all of the data is as close as can be to the CPU Spark will not give any significant performance improvement
Of course it's not, Spark is not a voodoo magic that somehow might get your data closer to the CPU, but it can help you scale among machines, thus CPUs (point #1)
Spark will have the overhead setting up parallel tasks even though it will be just running on one machine, so this overhead is redundant in this case ?
I may sound captain obvious, but
There is also [cheeky] point 4) in the list of Why should I use Spark?. It's the hype -- Spark is a very sexy technology which is easy to "sell" to both your devs (it's the cutting edge of big data) and the company (your boss, in case you're building your own product, your customer in case you're building product for somebody else).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With