Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove Empty Partitions from Spark RDD

I am fetching data from HDFS and storing it in a Spark RDD. Spark creates the number of partitions based on the number of HDFS blocks. This leads to a large number of empty partitions which also get processed during piping. To remove this overhead, I want to filter out all the empty partitions from the RDD. I am aware of coalesce and repartition, but there is no guarantee that all the empty partitions will be removed.

Is there any other way to go about this?

like image 933
user3898179 Avatar asked Oct 22 '15 09:10

user3898179


People also ask

How do I remove a blank partition in spark?

There isn't an easy way to simply delete the empty partitions from a RDD. coalesce doesn't guarantee that the empty partitions will be deleted. If you have a RDD with 40 blank partitions and 10 partitions with data, there will still be empty partitions after rdd. coalesce(45) .

How do I reduce the number of partitions in spark?

Spark RDD coalesce() is used only to reduce the number of partitions. This is optimized or improved version of repartition() where the movement of the data across the partitions is lower using coalesce.

How many partitions should a spark RDD have?

As already mentioned above, one partition is created for each block of the file in HDFS which is of size 64MB. However, when creating a RDD a second argument can be passed that defines the number of partitions to be created for an RDD. The above line of code will create an RDD named textFile with 5 partitions.

What is coalesce 1?

Coalesce is a method to partition the data in a dataframe. This is mainly used to reduce the number of partitions in a dataframe. You can refer to this link and link for more details on coalesce and repartition. And yes if you use df.coalesce(1) it'll write only one file (in your case one parquet file)


1 Answers

There isn't an easy way to simply delete the empty partitions from a RDD.

coalesce doesn't guarantee that the empty partitions will be deleted. If you have a RDD with 40 blank partitions and 10 partitions with data, there will still be empty partitions after rdd.coalesce(45).

The repartition method splits the data evenly over all the partitions, so there won't be any empty partitions. If you have a RDD with 50 blank partitions and 10 partitions with data and run rdd.repartition(20), the data will be evenly split across the 20 partitions.

like image 177
Powers Avatar answered Sep 29 '22 19:09

Powers