Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to repartition a compressed file in Apache Spark?

I have thousands of compressed files each of size 2GB sitting in HDFS. I am using spark to process these files. I am using Spark textFile() method to load the files from HDFS. My Question is how can I repartition the data so that i can process each file in parallel. Currently each .gz file is processed in a single task. So If i process 1000 files only 1000 tasks are executed. I understand, that compressed files are not splittable. But is there any other approach i could use to run my job faster?

like image 750
None Avatar asked Oct 19 '22 12:10

None


1 Answers

You can use rdd.repartition(#partitions) after loading the file. This has an associated shuffle cost, so you need to evaluate if the performance gain in parallelization covers for this initial shuffle cost.

Another way would be to perform any transformations (map, filter, ...) on the initial partition and use any shuffle stage already present in your pipeline to repartition the RDD. e.g.

rdd.map().filter().flatMap().sortBy(f, numPartitions=new#ofpartitions)
like image 81
maasg Avatar answered Oct 23 '22 01:10

maasg