Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark DataFrame Repartition and Parquet Partition

  1. I am using repartition on columns to store the data in parquet. But I see that the no. of parquet partitioned files are not same with the no. of Rdd partitions. Is there no correlation between rdd partitions and parquet partitions?

  2. When I write the data to parquet partition and I use Rdd repartition and then I read the data from parquet partition , is there any condition when the rdd partition numbers will be same during read / write?

  3. How is bucketing a dataframe using a column id and repartitioning a dataframe via the same column id different?

  4. While considering the performance of joins in Spark should we be looking at bucketing or repartitioning (or maybe both)

like image 542
Ayan Biswas Avatar asked Sep 26 '18 15:09

Ayan Biswas


People also ask

What is the difference between partition and repartition in Spark?

repartition() creates a specified number of partitions in memory. The partitionBy () will write files to disk for each memory partition and partition column.

Can Parquet file be partitioned?

An ORC or Parquet file contains data columns. To these files you can add partition columns at write time. The data files do not store values for partition columns; instead, when writing the files you divide them into groups (partitions) based on column values.

How do I repartition a Spark data frame?

If you want to increase the partitions of your DataFrame, all you need to run is the repartition() function. Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.

Why coalesce is better than repartition in Spark?

Differences between coalesce and repartition The repartition algorithm does a full shuffle of the data and creates equal sized partitions of data. coalesce combines existing partitions to avoid a full shuffle.


1 Answers

Couple of things here that you;re asking - Partitioning, Bucketing and Balancing of data,

Partitioning:

  1. Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion.
  2. Partitioning tables changes how persisted data is structured and will now create subdirectories reflecting this partitioning structure.
  3. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering.

In Spark, this is done by df.write.partitionedBy(column*) and groups data by partitioning columns into same sub directory.

Bucketing:

  1. Bucketing is another technique for decomposing data sets into more manageable parts. Based on columns provided, the entire data is hashed into a user-defined number of buckets (files).
  2. Synonymous to Hive's Distribute By

In Spark, this is done by df.write.bucketBy(n, column*) and groups data by partitioning columns into same file. number of files generated is controlled by n

Repartition:

  1. It returns a new DataFrame balanced evenly based on given partitioning expressions into given number of internal files. The resulting DataFrame is hash partitioned.
  2. Spark manages data on these partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.

In Spark, this is done by df.repartition(n, column*) and groups data by partitioning columns into same internal partition file. Note that no data is persisted to storage, this is just internal balancing of data based on constraints similar to bucketBy

Tl;dr

1) I am using repartition on columns to store the data in parquet. But I see that the no. of parquet partitioned files are not same with the no. of Rdd partitions. Is there no correlation between rdd partitions and parquet partitions?

  • repartition has correlation to bucketBy not partitionedBy. partitioned files is governed by other configs like spark.sql.shuffle.partitions and spark.default.parallelism

2) When I write the data to parquet partition and I use Rdd repartition and then I read the data from parquet partition , is there any condition when the rdd partition numbers will be same during read / write?

  • during read time, the number of partitions will be equal to spark.default.parallelism

3) How is bucketing a dataframe using a column id and repartitioning a dataframe via the same column id different?

  • Working similar, except, bucketing is a write operation and is used for persistence.

4) While considering the performance of joins in Spark should we be looking at bucketing or repartitioning (or maybe both)

  • repartition of both datasets are in memory, if one or both the datasets are persisted, then look into bucketBy also.
like image 78
Chitral Verma Avatar answered Sep 20 '22 09:09

Chitral Verma