Does S3 Strong Consistency mean it is safe to use S3 as a checkpointing location for Spark Structured Streaming applications?

Question

In the past, the general consensus was such that you should not use S3 as checkpointing location for Spark Structured Streaming applications.

However, now that S3 offers strong read after write consistency, is it safe to use S3 as a checkpointing location? If it is not safe, why?

In my experiments, I continue to see checkpointing related exceptions in my Spark Structured streaming applications, but I am uncertain where the problem actually lies.

stevel · Accepted Answer

not really. you get consistency of list and updates, but rename is still mocked with copy and delete...and I think the standard checkpoint algorithm depends on it.

hadoop 3.3.1 added a new API, Abortable to aid with a custom S3 stream checkpoint committer -the idea os that the checkpointer woudl write sstraight to the destination, but abort the write when aborting the checkpoint. a normal close() would finish the write and manifest the file. see https://issues.apache.org/jira/browse/HADOOP-16906

AFAIK nobody has done the actual committer. opportunity for you to contribute there...

Does S3 Strong Consistency mean it is safe to use S3 as a checkpointing location for Spark Structured Streaming applications?

Tags:

amazon-s3

apache-spark

spark-structured-streaming

Lukasz Krawiec

1 Answers

stevel

Recent Activity

Donate For Us

Does S3 Strong Consistency mean it is safe to use S3 as a checkpointing location for Spark Structured Streaming applications?

Tags:

amazon-s3

apache-spark

spark-structured-streaming

Lukasz Krawiec

1 Answers

stevel

Related questions

Recent Activity

Donate For Us