Spark: increase number of partitions without causing a shuffle?

Tags:

apache-spark

When decreasing the number of partitions one can use coalesce, which is great because it doesn't cause a shuffle and seems to work instantly (doesn't require an additional job stage).

I would like to do the opposite sometimes, but repartition induces a shuffle. I think a few months ago I actually got this working by using CoalescedRDD with balanceSlack = 1.0 - so what would happen is it would split a partition so that the resulting partitions location where all on the same node (so small net IO).

This kind of functionality is automatic in Hadoop, one just tweaks the split size. It doesn't seem to work this way in Spark unless one is decreasing the number of partitions. I think the solution might be to write a custom partitioner along with a custom RDD where we define getPreferredLocations ... but I thought that is such a simple and common thing to do surely there must be a straight forward way of doing it?

Things tried:

.set("spark.default.parallelism", partitions) on my SparkConf, and when in the context of reading parquet I've tried sqlContext.sql("set spark.sql.shuffle.partitions= ..., which on 1.0.0 causes an error AND not really want I want, I want partition number to change across all types of job, not just shuffles.

948

asked Nov 20 '14 12:11

samthebest

1 Answers

Watch this space

https://issues.apache.org/jira/browse/SPARK-5997

This kind of really simple obvious feature will eventually be implemented - I guess just after they finish all the unnecessary features in Datasets.

answered Oct 22 '22 01:10

samthebest

Related questions
                            
                                Circumventing variance checks with extension methods
                            
                                What causes this Scala program to become suspended due to tty output when run in the background? [duplicate]
                            
                                Is it possible to have a generic logging filter in finagle that can be "inserted anywhere" in a chain of andThens?
                            
                                Make ScalaFx working on both JDK 8 and 11
                            
                                create json representation for decimal logical type and byte types for avro schema
                            
                                How can I set a timeout for a blocking call?
                            
                                How do I define a custom HttpMessageConverter for the reactive Spring WebClient (Spring-WebFlux)
                            
                                Can monad with broken associativity law yield incorrect result in for-comprehension?
                            
                                Spark Scheduling Within an Application : performance issue
                            
                                Iterator of Strings to Inputstream of bytes
                            
                                High level of what I will need to do to port ReactiveMongo scala to use cats effects?
                            
                                Why does merging with empty fs2.Stream change program's behavior
                            
                                How to use kafka.group.id and checkpoints in spark 3.0 structured streaming to continue to read from Kafka where it left off after restart?
                            
                                How to ensure tuple is homogenous?
                            
                                How to ask Scala if evidence exists for all instantiations of type parameter?
                            
                                Why Semigroupal for types that have Monad instances don't combine?
                            
                                Cannot add package description in ScalaDoc (Scala 3)
                            
                                Can I make the primary constructor private while keeping auxiliary constructors public in Scala?
                            
                                Scala compilation error
                            
                                Use of option helper in Play Framework 2.0 templates

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: increase number of partitions without causing a shuffle?

Tags:

scala

apache-spark

samthebest

People also ask

1 Answers

samthebest

Recent Activity

Donate For Us