Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does S3 Strong Consistency mean it is safe to use S3 as a checkpointing location for Spark Structured Streaming applications?

In the past, the general consensus was such that you should not use S3 as checkpointing location for Spark Structured Streaming applications.

However, now that S3 offers strong read after write consistency, is it safe to use S3 as a checkpointing location? If it is not safe, why?

In my experiments, I continue to see checkpointing related exceptions in my Spark Structured streaming applications, but I am uncertain where the problem actually lies.

like image 566
Lukasz Krawiec Avatar asked Nov 16 '25 13:11

Lukasz Krawiec


1 Answers

not really. you get consistency of list and updates, but rename is still mocked with copy and delete...and I think the standard checkpoint algorithm depends on it.

hadoop 3.3.1 added a new API, Abortable to aid with a custom S3 stream checkpoint committer -the idea os that the checkpointer woudl write sstraight to the destination, but abort the write when aborting the checkpoint. a normal close() would finish the write and manifest the file. see https://issues.apache.org/jira/browse/HADOOP-16906

AFAIK nobody has done the actual committer. opportunity for you to contribute there...

like image 169
stevel Avatar answered Nov 19 '25 10:11

stevel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!