Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive bucketing through sparkSQL

I have one doubt regarding bucketing in hive. I have created one temporary table which is bucketed on column key.

Through spark SQL I am inserting data into this temporary table. I have enabled the hive.enforce.bucketing to true in spark session.

When I check the base directory for this table, it is showing the file name prefixed with part_*.

However when I am manually inserting the data into this table through some another table, I am seeing the the files prefixed 00000_*.

I am not sure if spark sql is writing the data in buckets.

Can someone please help.

Thanks,

like image 618
Sumit D Avatar asked Aug 02 '18 13:08

Sumit D


People also ask

Can we use bucketing in dynamic partition?

dynamic. partition=true property. So, we can enable dynamic bucketing while loading data into hive table By setting this property.

Can we do bucketing without partitioning in Hive?

Bucketing can also be done even without partitioning on Hive tables. Bucketed tables allow much more efficient sampling than the non-bucketed tables.

What is difference between Spark bucketing and Hive bucketing?

In Hive it needs the reducer per which the number of files to be created. But where as in Spark bucketing we do not have a reducer so it would end up creating N number of files based on the number of tasks.

Does Spark SQL support bucketing?

Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.


2 Answers

A confusing area.

I found this some time ago:

However, Hive bucketed tables are supported from Spark 2.3 onwards. Spark normally disallow users from writing outputs to Hive Bucketed tables. Setting hive.enforce.bucketing=false and hive.enforce.sorting=false will allow you to save to Hive Bucketed tables.

In Spark's JIRA: https://issues.apache.org/jira/browse/SPARK-17729

Hive allows inserting data to bucketed table without guaranteeing bucketed and sorted-ness based on these two configs : hive.enforce.bucketing and hive.enforce.sorting.

With this jira, Spark still won't produce bucketed data as per Hive's bucketing guarantees, but will allow writes IFF user wishes to do so without caring about bucketing guarantees. Ability to create bucketed tables will enable adding test cases to Spark while pieces are being added to Spark have it support hive bucketing (eg. https://github.com/apache/spark/pull/15229)

But from the definitive source https://spark.apache.org/docs/latest/sql-migration-guide-hive-compatibility.html#unsupported-hive-functionality the following:

Unsupported Hive Functionality Below is a list of Hive features that we don’t support yet. Most of these features are rarely used in Hive deployments. Major Hive Features Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL doesn’t support buckets yet.

So to answer your question: you are getting the Spark approach to Hive Bucketing which is an approximation and thus not really the same thing.

like image 112
thebluephantom Avatar answered Sep 22 '22 16:09

thebluephantom


While Spark (in versions <= 2.4, at least) doesn't directly support Hive's bucketing format, it is possible to get Spark to output bucketed data that is readable by Hive, by using SparkSQL to load the data into a Hive table:

//enable Hive support when creating/configuring the spark session
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()

//register DF as view that can be used with SparkSQL
val testDF = Seq((1, "a"),(2, "b"),(3, "c")).toDF("number", "letter")
testDF.createOrReplaceTempView("testDF")

//create Hive table, can also be done manually, e.g. via Hive CLI
val createTableSQL = "CREATE TABLE testTable (number int, letter string) CLUSTERED BY number INTO 1 BUCKETS STORED AS PARQUET"
spark.sql(createTableSQL)

//load data from DF into Hive, output parquet files will be bucketed and readable by Hive
spark.sql("INSERT INTO testTable SELECT * FROM testDF")
like image 24
jmng Avatar answered Sep 23 '22 16:09

jmng