Remove Empty Partitions from Spark RDD

Tags:

I am fetching data from HDFS and storing it in a Spark RDD. Spark creates the number of partitions based on the number of HDFS blocks. This leads to a large number of empty partitions which also get processed during piping. To remove this overhead, I want to filter out all the empty partitions from the RDD. I am aware of coalesce and repartition, but there is no guarantee that all the empty partitions will be removed.

Is there any other way to go about this?

933

asked Oct 22 '15 09:10

user3898179

1 Answers

There isn't an easy way to simply delete the empty partitions from a RDD.

coalesce doesn't guarantee that the empty partitions will be deleted. If you have a RDD with 40 blank partitions and 10 partitions with data, there will still be empty partitions after rdd.coalesce(45).

The repartition method splits the data evenly over all the partitions, so there won't be any empty partitions. If you have a RDD with 50 blank partitions and 10 partitions with data and run rdd.repartition(20), the data will be evenly split across the 20 partitions.

177

answered Sep 29 '22 19:09

Powers

Related questions
                            
                                Does Spark allow to use Amazon Assumed Role and STS temporary credentials for DynamoDB?
                            
                                HBase multiple column families performance
                            
                                Unable to connect to Hive2 using Python
                            
                                Designing HBase schema to best support specific queries
                            
                                Running wordcount sample using MRV1 on CDH4.0.1 VM
                            
                                Evaluate expression in HIVE set statements
                            
                                Accessing hive metastore using jdbc with kerberos keytab
                            
                                How to remove duplicate columns after a JOIN in Pig?
                            
                                How to efficiently update Impala tables whose files are modified very frequently
                            
                                Hadoop - appropriate block size for unsplittable files of varying sizes (200-500mb)
                            
                                Apache Spark Handling Skewed Data
                            
                                Avoid starting HiveThriftServer2 with created context programmatically
                            
                                Any scalable OLAP database (web app scale)?
                            
                                error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol
                            
                                Setting external jars to hadoop classpath
                            
                                Hadoop Resource Manager not starting
                            
                                Merging multiple LZO compressed files on HDFS
                            
                                Getting existing mapreduce job from cluster (the job could be running or completed)
                            
                                Distributed Job scheduling, management, and reporting
                            
                                Write pandas table to impala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove Empty Partitions from Spark RDD

Tags:

apache-spark

rdd

hadoop

pyspark

user3898179

People also ask

1 Answers

Powers

Recent Activity

Donate For Us