Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between partitioning and bucketing in Spark?

I try to optimize a join query between two spark dataframes, let's call them df1, df2 (join on common column "SaleId"). df1 is very small (5M) so I broadcast it among the nodes of the spark cluster. df2 is very large (200M rows) so I tried to bucket/repartition it by "SaleId".

In Spark, what is the difference between partitioning the data by column and bucketing the data by column?

for example:

partition:

df2 = df2.repartition(10, "SaleId")

bucket:

df2.write.format('parquet').bucketBy(10, 'SaleId').mode("overwrite").saveAsTable('bucketed_table'))

After each one of those techniques I just joined df2 with df1.

I can't figure out which of those is the right technique to use. Thank you

like image 943
nofar mishraki Avatar asked Jul 02 '19 17:07

nofar mishraki


People also ask

What is the difference between bucketing and partitioning?

Bucketing decomposes data into more manageable or equal parts. With partitioning, there is a possibility that you can create multiple small partitions based on column values. If you go for bucketing, you are restricting number of buckets to store the data.

What is bucketing in Spark?

Bucketing is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning. When applied properly bucketing can lead to join optimizations by avoiding shuffles (aka exchanges) of tables participating in the join.

What is the difference between Hive partition and Spark partition?

They are both chunks of data, but Spark splits data in order to process it in parallel in memory. Hive partition is in the storage, in the disk, in persistence.

Can we create partitioning and bucketing on same column in Hive?

you can !! In that case, you will be having buckets inside partitioned data !


1 Answers

repartition is for using as part of an Action in the same Spark Job.

bucketBy is for output, write. And thus for avoiding shuffling in the next Spark App, typically as part of ETL. Think of JOINs. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4861715144695760/2994977456373837/5701837197372837/latest.html which is an excellent concise read. bucketBy tables can only be read by Spark though currently.

like image 110
thebluephantom Avatar answered Oct 05 '22 04:10

thebluephantom