Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TextIO.Write - does it append to or replace the output files (Google Cloud Dataflow)

I cannot find any documentation on it, so I wonder what is the behavior if the output files already exist (in a gs:// bucket)?

Thanks, G

like image 400
G B Avatar asked Mar 16 '23 21:03

G B


1 Answers

The files will be overwritten. There are several motivations for this:

  • The "report-like" use case (compute a summary of the input data and put the results on GCS) seems to be a lot more frequent than the use case where you are producing data incrementally and putting more of it onto GCS with each execution of the pipeline.
  • It is good if rerunning a pipeline is idempotent(-ish?). E.g. if you find a bug in your pipeline, you can just fix it and rerun it, and enjoy the overwritten correct results. A pipeline that appends to files would be very difficult to work with in this matter.
  • It is not required to specify the number of output shards for TextIO.Write; it can slightly differ between different executions, even for exactly the same pipeline and the same input data. The semantics of appending in that case would be very confusing.
  • Appending is, as far as I know, impossible to implement efficiently using any filesystem I'm aware of, while preserving the atomicity and fault tolerance guarantees (e.g. that you produce all output or none of it, even in the face of bundle re-executions due to failures).

This behavior will be documented in the next version of SDK that appears on github.

like image 165
jkff Avatar answered May 13 '23 18:05

jkff