I have a requirement to load data from an Hive table using Spark SQL <code>HiveContext</code> and load into HDFS. By default, the <code>DataFrame</code> from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no overloaded method in <code>HiveContex</code>t to take number of partitions parameter. Repartitioning of the RDD causes shuffling and results in more processing time. > <pre class="prettyprint"><code>val result = sqlContext.sql("select * from bt_st_ent") </code></pre> Has the log output of: <pre class="prettyprint"><code>Starting task 0.0 in stage 131.0 (TID 297, aster1.com, partition 0,NODE_LOCAL, 2203 bytes) Starting task 1.0 in stage 131.0 (TID 298, aster1.com, partition 1,NODE_LOCAL, 2204 bytes) </code></pre> I would like to know is there any way to increase the partitions size of the SQL output.

A very common and painful problem. You should look for a key which distributes the data in uniform partitions. The you can use the <code>DISTRIBUTE BY</code> and <code>CLUSTER BY</code> operators to tell spark to group rows in a partition. This will incur some overhead on the query itself. But will result in evenly sized partitions. Deepsense has a very good tutorial on this.

How to control partition size in Spark SQL

Tags:

I have a requirement to load data from an Hive table using Spark SQL HiveContext and load into HDFS. By default, the DataFrame from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no overloaded method in HiveContext to take number of partitions parameter.

Repartitioning of the RDD causes shuffling and results in more processing time.

val result = sqlContext.sql("select * from bt_st_ent")

Has the log output of:

Starting task 0.0 in stage 131.0 (TID 297, aster1.com, partition 0,NODE_LOCAL, 2203 bytes) Starting task 1.0 in stage 131.0 (TID 298, aster1.com, partition 1,NODE_LOCAL, 2204 bytes)

I would like to know is there any way to increase the partitions size of the SQL output.

970

asked Jul 07 '16 15:07

nagendra

2 Answers

Spark < 2.0:

You can use Hadoop configuration options:

mapred.min.split.size.
mapred.max.split.size

as well as HDFS block size to control partition size for filesystem based formats*.

val minSplit: Int = ??? val maxSplit: Int = ???  sc.hadoopConfiguration.setInt("mapred.min.split.size", minSplit) sc.hadoopConfiguration.setInt("mapred.max.split.size", maxSplit)

Spark 2.0+:

You can use spark.sql.files.maxPartitionBytes configuration:

spark.conf.set("spark.sql.files.maxPartitionBytes", maxSplit)

In both cases these values may not be in use by a specific data source API so you should always check documentation / implementation details of the format you use.

* Other input formats can use different settings. See for example

Partitioning in spark while reading from RDBMS via JDBC
Difference between mapreduce split and spark paritition

Furthermore Datasets created from RDDs will inherit partition layout from their parents.

Similarly bucketed tables will use bucket layout defined in the metastore with 1:1 relationship between bucket and Dataset partition.

121

answered Oct 16 '22 13:10

zero323

A very common and painful problem. You should look for a key which distributes the data in uniform partitions. The you can use the DISTRIBUTE BY and CLUSTER BY operators to tell spark to group rows in a partition. This will incur some overhead on the query itself. But will result in evenly sized partitions. Deepsense has a very good tutorial on this.

answered Oct 16 '22 11:10

Fokko Driesprong

Related questions
                            
                                Spring Boot. @DataJpaTest H2 embedded database create schema
                            
                                Method overloading and inheritance
                            
                                How to set PYTHONPATH to multiple folders
                            
                                Which image format i should for ios development native ? SVG or PNG?
                            
                                Don't allow direct calls to Microservices. Only allow through API Gateway
                            
                                Is there a way to know if an Emoji is supported in iOS?
                            
                                Three gray dots under variable names in Visual Studio
                            
                                Mouseleave triggered by click
                            
                                See / setup a user with MongoDB Compass?
                            
                                Destructuring Variables Performance
                            
                                input type datetime-local is not working in firefox
                            
                                With guaranteed copy elision, why does the class need to be fully defined?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With