I find that by default, Spark seem to write many small parquet files. I think it maybe better if I use partitioning to reduce this? But how do I choose a partition key? For example, for a users dataset which I frequently query by ID do I partition by <code>id</code>? But I am thinking, will it create 1 parquet file for 1 user in that case? What if I frequently query by 2 keys but only 1 or the other not both at the same time, is it useful to partition by both keys? For example, lets say I query usually by <code>id</code> and <code>country</code>, do I use <code>partitionBy('id', 'country')</code>? If there is no specific pattern in which I query the data but want to limit the number of files, do I use <code>repartition</code> then?

Partitions create a subdirectory for each value of the partition field, so if you are filtering by that field, instead of reading every file it will read only the files in the appropiate subdirectory. <ul> <li>You should partition when your data is too large and you usually work with a subset of the data at a time. </li> <li>You should partition by a field that you both need to filter by frequently and that has low cardinality, i.e: it will create a relatively small amount of directories with relatively big amount of data on each directory.</li> </ul> You don't want to partition by a unique id, for example. It would create lots of directories with only one row per directory; this is very inefficient the moment you need to select more than one id. Some typical partition fields could be dates if you are working with time series (daily dumps of data for instance), geographies (country, branches,...) or taxonomies (types of object, manufacturer, etc).

Spark Parquet Partitioning: How to choose a key

Tags:

apache-spark

pyspark

parquet

I find that by default, Spark seem to write many small parquet files. I think it maybe better if I use partitioning to reduce this?

But how do I choose a partition key? For example, for a users dataset which I frequently query by ID do I partition by id? But I am thinking, will it create 1 parquet file for 1 user in that case?

What if I frequently query by 2 keys but only 1 or the other not both at the same time, is it useful to partition by both keys? For example, lets say I query usually by id and country, do I use partitionBy('id', 'country')?

If there is no specific pattern in which I query the data but want to limit the number of files, do I use repartition then?

644

asked Apr 07 '18 08:04

Jiew Meng

1 Answers

Partitions create a subdirectory for each value of the partition field, so if you are filtering by that field, instead of reading every file it will read only the files in the appropiate subdirectory.

You should partition when your data is too large and you usually work with a subset of the data at a time.
You should partition by a field that you both need to filter by frequently and that has low cardinality, i.e: it will create a relatively small amount of directories with relatively big amount of data on each directory.

You don't want to partition by a unique id, for example. It would create lots of directories with only one row per directory; this is very inefficient the moment you need to select more than one id.

Some typical partition fields could be dates if you are working with time series (daily dumps of data for instance), geographies (country, branches,...) or taxonomies (types of object, manufacturer, etc).

159

answered Sep 24 '22 09:09

Manu Valdés

Related questions
                            
                                stop-all.sh in Spark sbin/ folder is not stopping all slave nodes
                            
                                How to compute the inverse of a RowMatrix in Apache Spark?
                            
                                system cannot find the path specified in spark-shell
                            
                                Reducing potentially empty RDD's
                            
                                Calculate the mode of a PySpark DataFrame column?
                            
                                How to read specific lines from sparkContext
                            
                                Read file on remote machine in Apache Spark using ftp
                            
                                Scalaz Type Classes for Apache Spark RDDs
                            
                                Scala case class ignoring import in the Spark shell
                            
                                Do we still have to make a fat jar for submitting jobs in Spark 2.0.0?
                            
                                Conditional Join in Spark DataFrame
                            
                                PySpark How to read CSV into Dataframe, and manipulate it
                            
                                Spark program takes a really long time to complete execution
                            
                                How to spark-submit a python file in spark 2.1.0?
                            
                                Why is partition key column missing from DataFrame
                            
                                spark read partitioned data in S3 partly in glacier
                            
                                How to control preferred locations of RDD partitions?
                            
                                Pandas to spark data frame converts datetime datatype to bigint
                            
                                Where is my sparkDF.persist(DISK_ONLY) data stored?
                            
                                PySpark: How to judge column type of dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With