Recommended settings for writing to object stores says:
For object stores whose consistency model means that rename-based commits are safe use the
FileOutputCommitter
v2 algorithm for performance; v1 for safety.
Is it safe to use the v2 algorithm to write out to Google Cloud Storage?
What, exactly, does it mean for the algorithm to be "not safe"? What are the concrete set of criteria to use to decide if I am in a situation where v2 is not safe?
aah. I wrote that bit of the docs. And one of the papers you cite.
Apache Hadoop 3.3.5 added in MAPREDUCE-7341 the Intermediate Manifest Committer for correctness, performance and scalability on abfs and gcs. (it also works on hdfs, FWIW). it commits tasks by listing the output directory trees of task attempts and saves the list of files to rename to a manifest file, which is committed atomically. Job commit is a simple series of
This is correct for GCS as it relies on a single file rename as the sole atomic operation. For ABFS it adds support for rate limiting of IOPS and resilience to the way abfs fails when you try a few thousand renames in the same second. One of those examples of a problem which only surfaces in production, not in benchmarking.
This committer ships with Hadoop 3.3.5 and will not be backported -use hadoop binaries of this or a later version if you want to use it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With