If data fits on a single machine does it make sense to use Spark?

Tags:

I have 20GB of data that requires processing, all of this data fits on my local machine. I'm planning on using Spark or Scala parallel colleections to implement some algorithms and matrix multiplication against this data.

Since the data fits on a single machine should I use Scala parallel collections ?

Is this true : The main bottleneck in parallel tasks is getting the data to the CPU for processing, so since all of the data is as close as can be to the CPU Spark will not give any significant performance improvement ?

Spark will have the overhead setting up parallel tasks even though it will be just running on one machine, so this overhead is redundant in this case ?

253

asked May 28 '14 17:05

blue-sky

1 Answers

It's hard to provide some non-obvious instructions, like if you had your data and doesn't goes up to the 80% of memory and ..., then use local mode. Having said this, there are couple of points, which, in general, may make you use spark even if your data fits one's machine memory:

really intensive CPU processing, from the top of my head, it might be complicated parsing of texts
stability -- say you have many processing stages and you don't want to lose results, once your single machine goes down. it's especially important in case you have recurrent calculations, not one-off queries (this way, time you spend on bringing spark to the table might pay-off)
streaming -- you get your data from somewhere in a stream manner, and, though it's snapshot fits single machine, you have to orchestrate it somehow

In your particular case

so since all of the data is as close as can be to the CPU Spark will not give any significant performance improvement

Of course it's not, Spark is not a voodoo magic that somehow might get your data closer to the CPU, but it can help you scale among machines, thus CPUs (point #1)

Spark will have the overhead setting up parallel tasks even though it will be just running on one machine, so this overhead is redundant in this case ?

I may sound captain obvious, but

Take #2 and #3 into consideration, do you need them? If yes, go spark or something else
If no, implement your processing in a dumb way (parallel collections)
Profile and take a look. Are your processing is CPU bound? Can you speed up it, without lot of tweaks? If no, go spark.

There is also [cheeky] point 4) in the list of Why should I use Spark?. It's the hype -- Spark is a very sexy technology which is easy to "sell" to both your devs (it's the cutting edge of big data) and the company (your boss, in case you're building your own product, your customer in case you're building product for somebody else).

168

answered Sep 17 '22 20:09

om-nom-nom

Related questions
                            
                                Application not starting after pushing to Heroku
                            
                                Is having many threads in a JVM application expensive?
                            
                                Scala start Play server in production
                            
                                Scala - extractor unapply confusion
                            
                                Graphics2D transformation result does not match manual transformation
                            
                                Scala and == method in recursive defined types
                            
                                Poor performance of Array.map(f: A => B) in Scala
                            
                                Easily parse String of Key=Value pairs to Scala case class
                            
                                Rules on using a case statement to destruct a tuple in Scala
                            
                                Should I use GenSeq by default?
                            
                                Scala - simple design by contract
                            
                                Reading lines from file in Scala
                            
                                Change 'play dist' output file name via command-line
                            
                                How to run subprojects tests (including setup methods) sequentially when testing
                            
                                Scala @throws multiple exceptions
                            
                                How can I convert a String to a Symbol in Runtime in Scala?
                            
                                Scala web framework performance on raspberry pi
                            
                                Does Scala have a function application operator?
                            
                                "return this" in a covariant trait that return actual type
                            
                                Scala Puzzle: enforcing that two function arguments are of the same type AND both are a subtype of a given class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

If data fits on a single machine does it make sense to use Spark?

Tags:

parallel-processing

scala

apache-spark

blue-sky

People also ask

1 Answers

om-nom-nom

Recent Activity

Donate For Us