What is the difference between partitioning and bucketing in Spark?

Tags:

I try to optimize a join query between two spark dataframes, let's call them df1, df2 (join on common column "SaleId"). df1 is very small (5M) so I broadcast it among the nodes of the spark cluster. df2 is very large (200M rows) so I tried to bucket/repartition it by "SaleId".

In Spark, what is the difference between partitioning the data by column and bucketing the data by column?

for example:

partition:

df2 = df2.repartition(10, "SaleId")

bucket:

df2.write.format('parquet').bucketBy(10, 'SaleId').mode("overwrite").saveAsTable('bucketed_table'))

After each one of those techniques I just joined df2 with df1.

I can't figure out which of those is the right technique to use. Thank you

943

asked Jul 02 '19 17:07

nofar mishraki

1 Answers

repartition is for using as part of an Action in the same Spark Job.

bucketBy is for output, write. And thus for avoiding shuffling in the next Spark App, typically as part of ETL. Think of JOINs. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4861715144695760/2994977456373837/5701837197372837/latest.html which is an excellent concise read. bucketBy tables can only be read by Spark though currently.

110

answered Oct 05 '22 04:10

thebluephantom

Related questions
                            
                                How to assign arbitrary metadata to pyarrow.Table / Parquet columns
                            
                                What is the difference between tf-nightly-gpu and tensorflow-gpu
                            
                                Merging "add" form in Django Admin from 2 or more Models (connected with one-to-one relationship)
                            
                                Convert pandas column of lists into matrix representation (One Hot Encoding)
                            
                                Processing large files in chunks: inconsistent seek with readline
                            
                                How to make a generic Protobuf Parser DoFn in python beam?
                            
                                How to implement LSD in opencv 4.1.0
                            
                                POST file to AWS Mediastore with Python 3 without SDK, without CLI
                            
                                Date Difference Between Two Device Failures
                            
                                _DeadlockError in Django while starting server
                            
                                Python type hints for generic *args (specifically zip or zipWith)
                            
                                flask-restful - resource class for current request
                            
                                Split string every n characters but without splitting a word [duplicate]
                            
                                OpenCV template matching, multiple templates
                            
                                Why do I get a warning when concatenating lists of mixed types in Pycharm?
                            
                                Gcloudignore file is not respected during deployment to App Engine
                            
                                Changing the log level of an imported module
                            
                                pandas GroupBy and cumulative mean of previous rows in group
                            
                                Parsing Mbox from an open file-like object in Python?
                            
                                How not to start same task and wait until it is finished with celery beat

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between partitioning and bucketing in Spark?

Tags:

python

apache-spark

bucket

data-partitioning

nofar mishraki

People also ask

1 Answers

thebluephantom

Recent Activity

Donate For Us