Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Parquet Partitioning: How to choose a key

I find that by default, Spark seem to write many small parquet files. I think it maybe better if I use partitioning to reduce this?

But how do I choose a partition key? For example, for a users dataset which I frequently query by ID do I partition by id? But I am thinking, will it create 1 parquet file for 1 user in that case?

What if I frequently query by 2 keys but only 1 or the other not both at the same time, is it useful to partition by both keys? For example, lets say I query usually by id and country, do I use partitionBy('id', 'country')?

If there is no specific pattern in which I query the data but want to limit the number of files, do I use repartition then?

like image 644
Jiew Meng Avatar asked Apr 07 '18 08:04

Jiew Meng


People also ask

How do I partition a Parquet file in spark?

Spark parquet partition – Improving performance Partitioning is a feature of many databases and data processing frameworks and it is key to make jobs work at scale. We can do a parquet file partition using spark partitionBy () function. df. write. partitionBy ("gender","salary"). parquet ("/tmp/output/people2.parquet")

How to get the number of partitions in a spark job?

df = spark. range (0,20) print( df. rdd. getNumPartitions ()) Above example yields output as 5 partitions. When you running Spark jobs on the Hadoop cluster the default number of partitions is based on the following. On the HDFS cluster, by default, Spark creates one Partition for each block of the file.

What happens if you don’t provide a partition key in spark?

If you don’t provide a specific partition key (a column in case of a dataframe), data will be associated with a key. That will produce a (K,V) pair and the destination partition will be attributed by the following algorithm: HashPartitioner is the default partitioner used by Spark.

What is Spark SQL parquet?

Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons.


1 Answers

Partitions create a subdirectory for each value of the partition field, so if you are filtering by that field, instead of reading every file it will read only the files in the appropiate subdirectory.

  • You should partition when your data is too large and you usually work with a subset of the data at a time.

  • You should partition by a field that you both need to filter by frequently and that has low cardinality, i.e: it will create a relatively small amount of directories with relatively big amount of data on each directory.

You don't want to partition by a unique id, for example. It would create lots of directories with only one row per directory; this is very inefficient the moment you need to select more than one id.

Some typical partition fields could be dates if you are working with time series (daily dumps of data for instance), geographies (country, branches,...) or taxonomies (types of object, manufacturer, etc).

like image 159
Manu Valdés Avatar answered Sep 24 '22 09:09

Manu Valdés