Spark: Most efficient way to sort and partition data to be written as parquet

Tags:

My data is in principle a table, which contains a column ID and a column GROUP_ID, besides other 'data'.

In the first step I am reading CSV's into Spark, do some processing to prepare the data for the second step, and write the data as parquet. The second step does a lot of groupBy('GROUP_ID') and Window.partitionBy('GROUP_ID').orderBy('ID').

The goal now is -- in order to avoid shuffling in the second step -- to efficiently load the data in the first step, as this is a one-timer.

Question Part 1: AFAIK, Spark preserves the partitioning when loading from parquet (which is actually the basis of any "optimized write consideration" to be made) - correct?

I came up with three possibilities:

df.orderBy('ID').write.partitionBy('TRIP_ID').parquet('/path/to/parquet')
df.orderBy('ID').repartition(n, 'TRIP_ID').write.parquet('/path/to/parquet')
df.repartition(n, 'TRIP_ID').sortWithinPartitions('ID').write.parquet('/path/to/parquet')

I would set n such that the individual parquet files would be ~100MB.

Question Part 2: Is it correct that the three options produce "the same"/similar results in regard of the goal (avoid shuffling in the 2nd step)? If not, what is the difference? And which one is 'better'?

Question Part 3: Which of the three options performs better regarding step 1?

Thanks for sharing your knowledge!

EDIT 2017-07-24

After doing some tests (writing to and reading from parquet) it seems that Spark is not able to recover partitionBy and orderBy information by default in the second step. The number of partitions (as obtained from df.rdd.getNumPartitions() seems to be determined by the number of cores and/or by spark.default.parallelism (if set), but not by the number of parquet partitions. So answer for question 1 would be WRONG, and questions 2 and 3 would be irrelevant.

So it turns out the REAL QUESTION is: is there a way to tell Spark, that the data is already partitioned by column X and sorted by column Y?

853

asked Jul 20 '17 20:07

akoeltringer

2 Answers

You probably will be interested in bucketing support in Spark.

See details here https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-bucketing.html

large.write
  .bucketBy(4, "id")
  .sortBy("id")
  .mode(SaveMode.Overwrite)
  .saveAsTable(bucketedTableName)

Notice Spark 2.4 added support for bucket pruning (like partition pruning)

More direct functionality you're looking at is Hive' bucketed-sorted tables https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables This is not yet available in Spark (see PS section below)

Also notice that the sorting information will not be loaded by Spark automatically, but since the data is already sorted.. the sorting operation on it will actually be much faster as not much work to do - e.g. one pass on data just to confirm that it is already sorted.

PS. Spark and Hive bucketing are slightly different. This is umbrella ticket to provide a compatibility in Spark for bucketed tables created in Hive - https://issues.apache.org/jira/browse/SPARK-19256

answered Nov 03 '22 00:11

Tagar

As far as I know, NO there is no way to read data from parquet and tell Spark that it is already partitioned by some expression and ordered.

In short, one file on HDFS etc. is too big for one Spark partition. And even if you read whole file to one partition playing with Parquet properties such as parquet.split.files=false, parquet.task.side.metadata=true etc. there are would be most costs compare to just one shuffle.

answered Nov 03 '22 01:11

Mikhail Dubkov

Related questions
                            
                                How to count number of columns in Spark Dataframe?
                            
                                Upload zip file using --archives option of spark-submit on yarn
                            
                                Removing empty strings from maps in scala
                            
                                idea sbt java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
                            
                                How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?
                            
                                "Bad substitution" when submitting spark job to yarn-cluster
                            
                                PySpark: when function with multiple outputs [duplicate]
                            
                                Convert pyspark.sql.dataframe.DataFrame type Dataframe to Dictionary
                            
                                Spark LDA consumes too much memory
                            
                                apache spark "Py4JError: Answer from Java side is empty"
                            
                                SparkUI for pyspark - corresponding line of code for each stage?
                            
                                How to read/write protocol buffer messages with Apache Spark?
                            
                                In Apache Spark, how to convert a slow RDD/dataset into a stream?
                            
                                What is happening when Spark is calling ShuffleBlockFetcherIterator?
                            
                                spark parquet write gets slow as partitions grow
                            
                                Unable to understand error "SparkListenerBus has already stopped! Dropping event ..."
                            
                                How are number of iterations and number of partitions releated in Apache spark Word2Vec?
                            
                                Spark: Difference between collect(), take() and show() outputs after conversion toDF

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: Most efficient way to sort and partition data to be written as parquet

Tags:

apache-spark

apache-spark-sql

pyspark

pyspark-sql

akoeltringer

People also ask

2 Answers

Tagar

Mikhail Dubkov

Recent Activity

Donate For Us