I have thousands of compressed files each of size 2GB sitting in HDFS. I am using spark to process these files. I am using Spark textFile() method to load the files from HDFS. My Question is how can I repartition the data so that i can process each file in parallel. Currently each .gz file is processed in a single task. So If i process 1000 files only 1000 tasks are executed. I understand, that compressed files are not splittable. But is there any other approach i could use to run my job faster?
You can use rdd.repartition(#partitions)
after loading the file. This has an associated shuffle cost, so you need to evaluate if the performance gain in parallelization covers for this initial shuffle cost.
Another way would be to perform any transformations (map, filter, ...) on the initial partition and use any shuffle stage already present in your pipeline to repartition the RDD. e.g.
rdd.map().filter().flatMap().sortBy(f, numPartitions=new#ofpartitions)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With