How does Spark SQL decide the number of partitions it will use when loading data from a Hive table?

Tags:

apache-spark-sql

This question is same as Number of partitions of a spark dataframe created by reading the data from Hive table

But I think that question did not get a correct answer. Note that the question is asking how many partitions will be created when the dataframe is created as a result of executing a sql query against a HIVE table using SparkSession.sql method.

IIUC, the question above is distinct from asking how many partitions will be created when the dataframe is created as a result of executing some code like spark.read.json("examples/src/main/resources/people.json") which loads the data directly from the filesystem - which could be HDFS. I think the answer to this latter question is given by spark.sql.files.maxPartitionBytes

spark.sql.files.maxPartitionBytes 134217728 (128 MB) The maximum number of bytes to pack into a single partition when reading files.

Experimentally, I have tried creating a dataframe from a HIVE table and the # of partitions I get is not explained by total data in hive table / spark.sql.files.maxPartitionBytes

Also adding to the OP, it would be good to know how can the number of partitions be controlled i.e., when one wants to force spark to use a different number than it would by default.

References:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

890

asked May 19 '17 04:05

morpheus

1 Answers

TL;DR: The default number of partitions when reading data from Hive will be governed by the HDFS blockSize. The number of partitions can be increased by setting mapreduce.job.maps to appropriate value, and can be decreased by setting mapreduce.input.fileinputformat.split.minsize to appropriate value

Spark SQL creates an instance of HadoopRDD when loading data from a hive table.

An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the older MapReduce API (org.apache.hadoop.mapred).

enter image description here

HadoopRDD in turn splits input files according to the computeSplitSize method defined in org.apache.hadoop.mapreduce.lib.input.FileInputFormat (the new API) and org.apache.hadoop.mapred.FileInputFormat (the old API).

New API:

protected long computeSplitSize(long blockSize, long minSize,
                                  long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }

Old API:

protected long computeSplitSize(long goalSize, long minSize,
                                       long blockSize) {
    return Math.max(minSize, Math.min(goalSize, blockSize));
  }

computeSplitSize splits files according to the HDFS blockSize but if the blockSize is less than minSize or greater than maxSize then it is clamped to those extremes. The HDFS blockSize can be obtained from

hdfs getconf -confKey dfs.blocksize

According to Hadoop the definitive guide Table 8.5, the minSize is obtained from mapreduce.input.fileinputformat.split.minsize and the maxSize is obtained from mapreduce.input.fileinputformat.split.maxsize.

enter image description here

However, the book also mentions regarding mapreduce.input.fileinputformat.split.maxsize that:

This property is not present in the old MapReduce API (with the exception of CombineFileInputFormat). Instead, it is calculated indirectly as the size of the total input for the job, divided by the guide number of map tasks specified by mapreduce.job.maps (or the setNumMapTasks() method on JobConf).

this post also calculates the maxSize using the total input size divided by the number of map tasks.

148

answered Oct 21 '22 19:10

morpheus

Related questions
                            
                                How to filter one spark dataframe against another dataframe
                            
                                How do I collect a single column in Spark?
                            
                                Spark SQL filter multiple fields
                            
                                Building a StructType from a dataframe in pyspark
                            
                                How to select last row and also how to access PySpark dataframe by index?
                            
                                How to connect to remote hive server from spark [duplicate]
                            
                                dynamically bind variable/parameter in Spark SQL?
                            
                                Why does Scala compiler fail with "no ': _*' annotation allowed here" when Row does accept varargs?
                            
                                unexpected type: <class 'pyspark.sql.types.DataTypeSingleton'> when casting to Int on a ApacheSpark Dataframe
                            
                                How to overwrite entire existing column in Spark dataframe with new column?
                            
                                How do I get the last item from a list using pyspark?
                            
                                SparkException: Values to assemble cannot be null
                            
                                Convert timestamp to date in Spark dataframe
                            
                                How to specify schema for CSV file without using Scala case class?
                            
                                How to speed up Spark SQL unit tests?
                            
                                Spark 1.6: java.lang.IllegalArgumentException: spark.sql.execution.id is already set
                            
                                How do you create merge_asof functionality in PySpark?
                            
                                Spark - java IOException :Failed to create local dir in /tmp/blockmgr*
                            
                                pyspark using one task for mapPartitions when converting rdd to dataframe
                            
                                If I cache a Spark Dataframe and then overwrite the reference, will the original data frame still be cached?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does Spark SQL decide the number of partitions it will use when loading data from a Hive table?

Tags:

apache-spark-sql

morpheus

People also ask

1 Answers

morpheus

Recent Activity

Donate For Us