Hive bucketing through sparkSQL

2 Answers

A confusing area.

I found this some time ago:

However, Hive bucketed tables are supported from Spark 2.3 onwards. Spark normally disallow users from writing outputs to Hive Bucketed tables. Setting hive.enforce.bucketing=false and hive.enforce.sorting=false will allow you to save to Hive Bucketed tables.

In Spark's JIRA: https://issues.apache.org/jira/browse/SPARK-17729

Hive allows inserting data to bucketed table without guaranteeing bucketed and sorted-ness based on these two configs : hive.enforce.bucketing and hive.enforce.sorting.

With this jira, Spark still won't produce bucketed data as per Hive's bucketing guarantees, but will allow writes IFF user wishes to do so without caring about bucketing guarantees. Ability to create bucketed tables will enable adding test cases to Spark while pieces are being added to Spark have it support hive bucketing (eg. https://github.com/apache/spark/pull/15229)

But from the definitive source https://spark.apache.org/docs/latest/sql-migration-guide-hive-compatibility.html#unsupported-hive-functionality the following:

Unsupported Hive Functionality Below is a list of Hive features that we don’t support yet. Most of these features are rarely used in Hive deployments. Major Hive Features Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL doesn’t support buckets yet.

So to answer your question: you are getting the Spark approach to Hive Bucketing which is an approximation and thus not really the same thing.

112

answered Sep 22 '22 16:09

thebluephantom

While Spark (in versions <= 2.4, at least) doesn't directly support Hive's bucketing format, it is possible to get Spark to output bucketed data that is readable by Hive, by using SparkSQL to load the data into a Hive table:

//enable Hive support when creating/configuring the spark session
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()

//register DF as view that can be used with SparkSQL
val testDF = Seq((1, "a"),(2, "b"),(3, "c")).toDF("number", "letter")
testDF.createOrReplaceTempView("testDF")

//create Hive table, can also be done manually, e.g. via Hive CLI
val createTableSQL = "CREATE TABLE testTable (number int, letter string) CLUSTERED BY number INTO 1 BUCKETS STORED AS PARQUET"
spark.sql(createTableSQL)

//load data from DF into Hive, output parquet files will be bucketed and readable by Hive
spark.sql("INSERT INTO testTable SELECT * FROM testDF")

answered Sep 23 '22 16:09

jmng

Related questions
                            
                                Zeppelin java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
                            
                                Apache Spark - Dataset operations fail in abstract base class?
                            
                                Sort by date an Array of a Spark DataFrame Column
                            
                                Scala + SBT - How to configure reference.conf for a shaded Akka library
                            
                                Processing (OSM) PBF files in Spark
                            
                                Using stat.bloomFilter in Spark 2.0.0 to filter another dataframe
                            
                                Spark SQL "Limit"
                            
                                spark-submit config through file
                            
                                Scala/ Spark- Multiply an Integer with each value in a Dataframe Column
                            
                                How to enable Tungsten optimization in Spark 2?
                            
                                Retrieve Spark Mllib StringIndexer column mapping
                            
                                Efficient way to join a cached spark dataframe with other and cache again
                            
                                Is it the driver or the workers who reads the text file when sc.textfile is used?
                            
                                maximum number of columns we can have in dataframe spark scala
                            
                                How to enable spark-history server for standalone cluster non hdfs mode
                            
                                How to use Column.isin with array column in join?
                            
                                Spark SQL - DataFrame - select - transformation or action?
                            
                                AssertionError: all exprs should be Column
                            
                                Read json from Kafka and write json to other Kafka topic
                            
                                Using when and otherwise while converting boolean values to strings in Pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hive bucketing through sparkSQL

Tags:

apache-spark

apache-spark-sql

hive

data-processing

Sumit D

People also ask

2 Answers

thebluephantom

jmng

Recent Activity

Donate For Us