I'm looking at the Spark UI (Spark v1.6.0) for a stage of a job I'm currently running and I don't understand how to interpret what its telling me: The number of records in the "Shuffle Write Size / Records" column makes sense, those numbers are consistent with the data I'm processing.
What I do not understand is the numbers in "Input Size / Records". They indicate that the incoming data has only ~67 records in each partition; the job has 200 partitions so ~1200 records in all. I dont know what that is referring to, none of the input datasets to this job (which was implemented using SparkSQL) have ~1200 records in them.
So, I'm flummoxed as to what those numbers are referring to. Can anyone enlighten me?
Your Input Size/Record is too low. It means that at a time, your task is only executing approximately 14 MB of data which is too low. The thumb rule is that it should be 128 MB.
You can change this by change the HDFS block size to 128 MB i.e. hdfs.block.size
to 134217728
or if you are accessing from AWS S3 Storage, then you can set fs.s3a.block.size
to 134217728
in core-site.xml
file
Changing this will also bring down the number of parititions.
Next is your Shuffle Write Size / Records is too high. This means that the lot many data is getting exchanged between shuffles which is an expensive operation. You need to look at your code to see if you can optimize it or write your operations different so that it doesn't shuffle too much.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With