I'm on Spark 2.2.0, running on EMR. I have a big dataframe <code>df</code> (40G or so in compressed snappy files) which is partitioned by keys <code>k1</code> and <code>k2</code>. When I query by <code>k1</code> === <code>v1</code> or (<code>k1</code> === <code>v1</code> && <code>k2 ===</code>v2`), I can see that it's only querying the files in the partition (about 2% of the files). However if I cache or persist <code>df</code>, suddenly those queries are hitting all the partitions and either blows up memory or is much less performant. This is a big surprise - is there any way to do caching which preserves the partitoning information

This is to be expected. Spark internal columnar format used for caching is input format agnostic. Once you loaded data there there is no connection to the original input is gone. The exception here is new data source API [SPARK-22389][SQL] data source v2 partitioning reporting interface, which allows for persisting partitioning information, but it is new in 2.3 and still experimental.

Caching dataframes while keeping partitions

1 Answers

This is to be expected. Spark internal columnar format used for caching is input format agnostic. Once you loaded data there there is no connection to the original input is gone.

The exception here is new data source API [SPARK-22389][SQL] data source v2 partitioning reporting interface, which allows for persisting partitioning information, but it is new in 2.3 and still experimental.

137

answered Oct 21 '22 10:10

Alper t. Turker

Related questions
                            
                                Read random sample of files on S3 with Pyspark
                            
                                How to parallelize Spark scala computation?
                            
                                Can Dataframe joins in Spark preserve order?
                            
                                Spark Metrics: how to access executor and worker data?
                            
                                How to manage a Apache Spark context in Django?
                            
                                Deploy spark driver application without spark submit
                            
                                Setting up dynamic allocation in Apache Spark?
                            
                                Spark Local Mode - all jobs only use one CPU core
                            
                                spark - join one to many relationship dataframes
                            
                                Cannot change hive.exec.max.dynamic.partitions in Spark
                            
                                How to automate StructType creation for passing RDD to DataFrame
                            
                                How to expose Spark Driver behind dockerized Apache Zeppelin?
                            
                                Running from a local IDE against a remote Spark cluster
                            
                                spark streaming assertion failed: Failed to get records for spark-executor-a-group a-topic 7 244723248 after polling for 4096
                            
                                How Spark HashingTF works
                            
                                Spark load settings from multiple configuration files
                            
                                How to convert bytes from Kafka to their original object?
                            
                                Spark cosine distance between rows using Dataframe
                            
                                PCA output in Spark doesn't matches with scikit-learn
                            
                                Using Spark Structured Streaming to Read Data From Kafka, Issue of Over-time is Always Occured

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Caching dataframes while keeping partitions

Tags:

apache-spark

rongenre

People also ask

1 Answers

Alper t. Turker

Recent Activity

Donate For Us