Writing files to S3 using spark and scala is extremely slow. What is a better way to optimize this?

Question

df.write.option("header", "false").
          option("quote", null).
          option("delimiter", Delimiter).
          csv(tempPath)

When I save 2 KB files, it takes less than 5 seconds to save to S3 but when I try to save large files of about 20GB, it takes more than 1 hour.

Any suggestions to speed up the writing process?

I am using "s3a//" for saving.

UPDATE: When I manipulate the data of size 5 KB and generate 20KB file to store to S3, it takes 8 secs. When I try to manipulate the data of size 250MB and generate 20KB file to store to S3 it takes 45 mins. I am doing a count before saving so it is evaluated by spark before saving.

And it takes the less than a sec when I copy same 20KBB file to S3 using "aws S3 cp" command.

So what is Spark doing that it is slowing down the save process?

stevel · Accepted Answer

It's not the write, it's the fact that the output is committed by rename, which is emulated in s3a by a list and a copy and a delete. The more files you have, the more data you have and the longer it takes. That "use algorithm 2" technique makes things slightly faster, but not safe to use because:

The real problem here isn't visible, but it's that the commit algorithm assumes that the rename() is atomic and reliable, when it isn't. Data may be silently lost

Unless whoever supplies you with the s3 client says otherwise, work against HDFS, copy to S3 after. (EMR S3's connector is safe to use directly)

Spark Cloud Integration
Spark summit talk on spark and object stores

Writing files to S3 using spark and scala is extremely slow. What is a better way to optimize this?

Tags:

amazon-web-services

amazon-s3

scala

apache-spark

pallavik

1 Answers

The real problem here isn't visible, but it's that the commit algorithm assumes that the rename() is atomic and reliable, when it isn't. Data may be silently lost

stevel

Recent Activity

Donate For Us

Writing files to S3 using spark and scala is extremely slow. What is a better way to optimize this?

Tags:

amazon-web-services

amazon-s3

scala

apache-spark

pallavik

1 Answers

The real problem here isn't visible, but it's that the commit algorithm assumes that the rename() is atomic and reliable, when it isn't. Data may be silently lost

stevel

Related questions

Recent Activity

Donate For Us