Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing to Google Cloud Storage with v2 algorithm safe?

Recommended settings for writing to object stores says:

For object stores whose consistency model means that rename-based commits are safe use the FileOutputCommitter v2 algorithm for performance; v1 for safety.

Is it safe to use the v2 algorithm to write out to Google Cloud Storage?

What, exactly, does it mean for the algorithm to be "not safe"? What are the concrete set of criteria to use to decide if I am in a situation where v2 is not safe?

like image 571
Jacek Laskowski Avatar asked Mar 01 '23 16:03

Jacek Laskowski


1 Answers

aah. I wrote that bit of the docs. And one of the papers you cite.

  1. GCP implements rename() non-atomically, so v1 isn't really any more robust than v2. And v2 can be a lot faster.
  2. Azure "abfs" connector has O(1) Atomic renames, all good.
  3. S3 has suffered from both performance and safety. As it is now consistent, there's less risk, but its still horribly slow on production datasets. Use a higher-perfomance committer (EMR spark commtter, S3A committer)
  4. Or look at cloud-first formats like: Iceberg, Hudi, Delta Lake. This is where the focus is these days.

Update October 2022

Apache Hadoop 3.3.5 added in MAPREDUCE-7341 the Intermediate Manifest Committer for correctness, performance and scalability on abfs and gcs. (it also works on hdfs, FWIW). it commits tasks by listing the output directory trees of task attempts and saves the list of files to rename to a manifest file, which is committed atomically. Job commit is a simple series of

  1. list manifest files to commit, load these as the list results are paged in.
  2. create the output dir tree
  3. rename all source files to the destination via a thread pool
  4. task attempt cleanup, which again can be done in a thread pool for gcs performance
  5. save the summary to the _SUCCESS json file, and, if you want, another dir. the summary includes statistics on all store IO done during task and job commit.

This is correct for GCS as it relies on a single file rename as the sole atomic operation. For ABFS it adds support for rate limiting of IOPS and resilience to the way abfs fails when you try a few thousand renames in the same second. One of those examples of a problem which only surfaces in production, not in benchmarking.

This committer ships with Hadoop 3.3.5 and will not be backported -use hadoop binaries of this or a later version if you want to use it.

like image 87
stevel Avatar answered Mar 04 '23 11:03

stevel