Understanding parallelism in Spark and Scala

Question

I have some confusion about parallelism in Spark and Scala. I am running an experiment in which I have to read many (csv) files from the disk change/ process certain columns and then write it back to the disk.

In my experiments, if I use SparkContext's parallelize method only then it does not seem to have any impact on the performance. However simply using Scala's parallel collections (through par) reduces the time almost to half.

I am running my experiments in localhost mode with the arguments local[2] for the spark context.

My question is when should I use scala's parallel collections and when to use spark context's parallelize?

samthebest · Accepted Answer

SparkContext will have additional processing in order to support generality of multiple nodes, this will be constant on the data size so may be negligible for huge data sets. On 1 node this overhead will make it slower than Scala's parallel collections.

Use Spark when

You have more than 1 node
You want your job to be ready to scale to multiple nodes
The Spark overhead on 1 node is negligible because the data is huge, so you might as well choose the richer framework

Understanding parallelism in Spark and Scala

Tags:

parallel-processing

scala

apache-spark

MARK

1 Answers

samthebest

Recent Activity

Donate For Us

Understanding parallelism in Spark and Scala

Tags:

parallel-processing

scala

apache-spark

MARK

1 Answers

samthebest

Related questions

Recent Activity

Donate For Us