Spark not ignoring empty partitions

Tags:

I am trying to read a subset of a dataset by using pushdown predicate. My input dataset consists in 1,2TB and 43436 parquet files stored on s3. With the push down predicate I am supposed to read 1/4 of data.

Seeing the Spark UI. I see that the job actually reads 1/4 of data (300GB) but there are still 43436 partitions in the first stage of the job however only 1/4 of these partitions has data, the other 3/4 are empty ones (check the median input data in the attached screenshots).

I was expecting Spark to create partitions only for non empty partitions. I am seeing a 20% performance overhead when reading the whole dataset with the pushdown predicate comparing to reading the prefiltred dataset by another job (1/4 of data) directly. I suspect that this overhead is due to the huge number of empty partitions/tasks I have in my first stage, so I have two questions:

Are there any workaround to avoid these empty partitions?
Do you think to any other reason responsible for the overhead? may be the pushdown filter execution is naturally a little bit slow?

Thank you in advance

spark ui data read

enter image description here

427

asked Jun 25 '20 17:06

Wassim Maaoui

3 Answers

Using S3 Select, you can retrieve only a subset of data.

With Amazon EMR release version 5.17.0 and later, you can use S3 Select with Spark on Amazon EMR. S3 Select allows applications to retrieve only a subset of data from an object.

Otherwise, S3 acts as an object store, in which case, an entire object has to be read. In your case you have to read all content from all files, and filter them on client side.

There is actually very similar question, where by testing you can see that:

The input size was always the same as the Spark job that processed all of the data

You can also see this question about optimizing data read from s3 of parquet files.

answered Oct 18 '22 06:10

Yosi Dahari

Seems like your files are rather small: 1.2TB / 43436 ≈ 30MB. So you may want to look at increasing the spark.sql.files.maxPartitionBytes, to see if it reduces the total number of partitions. I have not much experience with S3, so not sure whether its going to help given this note in its description:

The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.

answered Oct 18 '22 06:10

mazaneicha

Empty partitions: It seems that spark (2.4.5) tries to really have partitions with size ≈ spark.sql.files.maxPartitionBytes (default 128MB) by packing many files into one partition, source code here. However it does this work before running the job, so it can't know that 3/4 of files will not output data after the pushed down predicate being applied. For the partitions where it will put only files whose lines will be filtered out, I ended up with empty partitions. This explains also why my max partition size is 44MB and not 128MB, because none of the partitions had by chance files that passed all the pushdown filter.

20% Overhead: Finally this is not due to empty partitions, I managed to have much less empty partitions by setting spark.sql.files.maxPartitionBytes to 1gb but it didn't improve reading. I think that the overhead is due to opening many files and reading their metadata. Spark estimates that opening a file is equivalent to reading 4MB spark.sql.files.openCostInBytes. So opening many files even if thanks to the filter won't be read shouldn't be negligible..

answered Oct 18 '22 07:10

Wassim Maaoui

Related questions
                            
                                Are there any Sql Server Full-Text Search (FTS) performance improvements since version 2008 R2?
                            
                                Time complexity of memoized fibonacci
                            
                                SQL INNER JOIN vs Where Exists Performance Consideration
                            
                                Puzzling performance difference between mac and a relatively powerful desktop
                            
                                Best way to merge two objects and sum the values of same key?
                            
                                Why does Chrome's Incognito mode retrieve resources faster than the normal mode?
                            
                                Performance improvement of Xamarin forms application
                            
                                why does Julia uses column major? is it fast
                            
                                `numpy.sum` vs. `ndarray.sum`
                            
                                Understand Kafka write speed
                            
                                Why modifying an instruction cause huge i-cache and i-TLB misses on x86?
                            
                                Significantly higher performance in React whilst profiling with Chrome
                            
                                Why does selenium chromedriver use less resources than regular chrome
                            
                                Xcode compile times: which Mac configuration delivers noticeable best performance? [closed]
                            
                                Erlang binary to lower case performance
                            
                                What's FFmpeg doing with avcodec_send_packet()?
                            
                                How can you store and modify large datasets in node.js?
                            
                                C++ std::vector inserting two elements alternative algorithm fails
                            
                                SwiftUI ScrollView inside list has terrible performance implications
                            
                                Very slow app debug startup, maybe "lldb-rpc-server" related?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark not ignoring empty partitions

Tags:

performance

amazon-s3

apache-spark

parquet

partitioning

Wassim Maaoui

People also ask

3 Answers

Yosi Dahari

mazaneicha

Wassim Maaoui

Recent Activity

Donate For Us