So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting <code>spark.default.parallelism</code> to 100, we have also tried changing the compression of the parquet to none (from gzip). No matter what we do the first stage of the spark job only has a single partition (once a shuffle occurs it gets repartitioned into 100 and thereafter obviously things are much much faster). Now according to a few sources (like below) parquet should be splittable (even if using gzip!), so I'm super confused and would love some advice. https://www.safaribooksonline.com/library/view/hadoop-application-architectures/9781491910313/ch01.html I'm using spark 1.0.0, and apparently the default value for <code>spark.sql.shuffle.partitions</code> is 200, so it can't be that. In fact all the defaults for parallelism are much more than 1, so I don't understand what's going on.

You should write your parquet files with a smaller block size. Default is 128Mb per block, but it's configurable by setting <code>parquet.block.size</code> configuration in the writer. The source of ParquetOuputFormat is here, if you want to dig into details. The block size is minimum amount of data you can read out of a parquet file which is logically readable (since parquet is columnar, you can't just split by line or something trivial like this), so you can't have more reading threads than input blocks.

How to split parquet files into many partitions in Spark?

Tags:

scala

apache-spark

parquet

So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting spark.default.parallelism to 100, we have also tried changing the compression of the parquet to none (from gzip). No matter what we do the first stage of the spark job only has a single partition (once a shuffle occurs it gets repartitioned into 100 and thereafter obviously things are much much faster).

Now according to a few sources (like below) parquet should be splittable (even if using gzip!), so I'm super confused and would love some advice.

https://www.safaribooksonline.com/library/view/hadoop-application-architectures/9781491910313/ch01.html

I'm using spark 1.0.0, and apparently the default value for spark.sql.shuffle.partitions is 200, so it can't be that. In fact all the defaults for parallelism are much more than 1, so I don't understand what's going on.

974

asked Nov 28 '14 18:11

samthebest

2 Answers

You should write your parquet files with a smaller block size. Default is 128Mb per block, but it's configurable by setting parquet.block.size configuration in the writer.

The source of ParquetOuputFormat is here, if you want to dig into details.

The block size is minimum amount of data you can read out of a parquet file which is logically readable (since parquet is columnar, you can't just split by line or something trivial like this), so you can't have more reading threads than input blocks.

189

answered Sep 25 '22 05:09

C4stor

The new way of doing it (Spark 2.x) is setting

spark.sql.files.maxPartitionBytes

Source: https://issues.apache.org/jira/browse/SPARK-17998 (the official documentation is not correct yet, misses the .sql)

From my experience, Hadoop settings no longer have effect.

answered Sep 22 '22 05:09

F Pereira

Related questions
                            
                                Ambiguous Reference to overloaded definition - One vs Two Parameters
                            
                                Compare json equality in Scala
                            
                                ORM for Lift: Mapper or JPA?
                            
                                Why can't the first parameter list of a class be implicit?
                            
                                Maven: mixing Java and Scala in one project
                            
                                Idiomatic Scala translation of Kiselyov's zippers?
                            
                                Finagle and Akka, why not use them together?
                            
                                Proxies / delegates in Scala
                            
                                all but the last item from a Scala Iterator (a.k.a. Iterator.init)
                            
                                Using futures and Thread.sleep
                            
                                SQLite for Scala
                            
                                Do monad transformers apply to getting JSON from services?
                            
                                Breakpoints from Scala Worksheet?
                            
                                Eta-expansion between methods and functions with overloaded methods in Scala
                            
                                Append new data to partitioned parquet files
                            
                                Monads VS Applicative functors for Futures
                            
                                What's the difference between shouldBe vs shouldEqual in Scala?
                            
                                noClassDefFoundError using Scala Plugin for Eclipse
                            
                                How can I define an anonymous generic Scala function?
                            
                                JavaFX entirely customized windows?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With