Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: increase number of partitions without causing a shuffle?

When decreasing the number of partitions one can use coalesce, which is great because it doesn't cause a shuffle and seems to work instantly (doesn't require an additional job stage).

I would like to do the opposite sometimes, but repartition induces a shuffle. I think a few months ago I actually got this working by using CoalescedRDD with balanceSlack = 1.0 - so what would happen is it would split a partition so that the resulting partitions location where all on the same node (so small net IO).

This kind of functionality is automatic in Hadoop, one just tweaks the split size. It doesn't seem to work this way in Spark unless one is decreasing the number of partitions. I think the solution might be to write a custom partitioner along with a custom RDD where we define getPreferredLocations ... but I thought that is such a simple and common thing to do surely there must be a straight forward way of doing it?

Things tried:

.set("spark.default.parallelism", partitions) on my SparkConf, and when in the context of reading parquet I've tried sqlContext.sql("set spark.sql.shuffle.partitions= ..., which on 1.0.0 causes an error AND not really want I want, I want partition number to change across all types of job, not just shuffles.

like image 948
samthebest Avatar asked Nov 20 '14 12:11

samthebest


People also ask

How do I increase the number of partitions in Spark?

If you want to increase the partitions of your DataFrame, all you need to run is the repartition() function. Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.

How do I stop my Spark from shuffling?

One way to avoid shuffles when joining two datasets is to take advantage of broadcast variables. When one of the datasets is small enough to fit in memory in a single executor, it can be loaded into a hash table on the driver and then broadcast to every executor.

What will avoid full shuffle in Spark if partitions are set to be decreased?

Coalesce doesn't involve a full shuffle. If the number of partitions is reduced from 5 to 2. Coalesce will not move data in 2 executors and move the data from the remaining 3 executors to the 2 executors. Thereby avoiding a full shuffle.

Does shuffling change the number of partitions?

Spark RDD Shuffle Though reduceByKey() triggers data shuffle, it doesn't change the partition count as RDD's inherit the partition size from parent RDD. You may get partition counts different based on your setup and how Spark creates partitions.

What is the default partition number for shuffle in spark?

Spark Default Shuffle Partition DataFrame increases the partition number to 200 automatically when Spark operation performs data shuffling (join (), union (), aggregation functions). This default shuffle partition number comes from Spark SQL configuration spark.sql.shuffle.partitions which is by default set to 200.

What is the difference between Spark shuffle and Spark data frame?

Spark data frames are the partitions of Shuffle operations. The original data frame partitions differ with the number of data frame partitions. The data moving from one partition to the other partition process in order to mat up, aggregate, join, or spread out in other ways is called a shuffle.

What are the important points to be noted about shuffle in spark?

Important points to be noted about Shuffle in Spark 1 Spark Shuffle partitions have a static number of shuffle partitions. 2 Shuffle Spark partitions do not change with the size of data. More ...

How to increase number of partitions without causing a shuffle?

Spark: increase number of partitions without causing a shuffle? Bookmark this question. Show activity on this post. When decreasing the number of partitions one can use coalesce, which is great because it doesn't cause a shuffle and seems to work instantly (doesn't require an additional job stage).


1 Answers

Watch this space

https://issues.apache.org/jira/browse/SPARK-5997

This kind of really simple obvious feature will eventually be implemented - I guess just after they finish all the unnecessary features in Datasets.

like image 54
samthebest Avatar answered Oct 22 '22 01:10

samthebest