I have some confusion about parallelism in Spark and Scala. I am running an experiment in which I have to read many (csv) files from the disk change/ process certain columns and then write it back to the disk.
In my experiments, if I use SparkContext's parallelize method only then it does not seem to have any impact on the performance. However simply using Scala's parallel collections (through par) reduces the time almost to half.
I am running my experiments in localhost mode with the arguments local[2] for the spark context.
My question is when should I use scala's parallel collections and when to use spark context's parallelize?
SparkContext will have additional processing in order to support generality of multiple nodes, this will be constant on the data size so may be negligible for huge data sets. On 1 node this overhead will make it slower than Scala's parallel collections.
Use Spark when
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With