Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding parallelism in Spark and Scala

I have some confusion about parallelism in Spark and Scala. I am running an experiment in which I have to read many (csv) files from the disk change/ process certain columns and then write it back to the disk.

In my experiments, if I use SparkContext's parallelize method only then it does not seem to have any impact on the performance. However simply using Scala's parallel collections (through par) reduces the time almost to half.

I am running my experiments in localhost mode with the arguments local[2] for the spark context.

My question is when should I use scala's parallel collections and when to use spark context's parallelize?

like image 759
MARK Avatar asked Nov 04 '13 18:11

MARK


1 Answers

SparkContext will have additional processing in order to support generality of multiple nodes, this will be constant on the data size so may be negligible for huge data sets. On 1 node this overhead will make it slower than Scala's parallel collections.

Use Spark when

  1. You have more than 1 node
  2. You want your job to be ready to scale to multiple nodes
  3. The Spark overhead on 1 node is negligible because the data is huge, so you might as well choose the richer framework
like image 69
samthebest Avatar answered Sep 20 '22 08:09

samthebest