df.write.option("header", "false").
option("quote", null).
option("delimiter", Delimiter).
csv(tempPath)
When I save 2 KB files, it takes less than 5 seconds to save to S3 but when I try to save large files of about 20GB, it takes more than 1 hour.
Any suggestions to speed up the writing process?
I am using "s3a//" for saving.
UPDATE: When I manipulate the data of size 5 KB and generate 20KB file to store to S3, it takes 8 secs. When I try to manipulate the data of size 250MB and generate 20KB file to store to S3 it takes 45 mins. I am doing a count before saving so it is evaluated by spark before saving.
And it takes the less than a sec when I copy same 20KBB file to S3 using "aws S3 cp" command.
So what is Spark doing that it is slowing down the save process?
It's not the write, it's the fact that the output is committed by rename, which is emulated in s3a by a list and a copy and a delete. The more files you have, the more data you have and the longer it takes. That "use algorithm 2" technique makes things slightly faster, but not safe to use because:
Unless whoever supplies you with the s3 client says otherwise, work against HDFS, copy to S3 after. (EMR S3's connector is safe to use directly)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With