TextIO.Write - does it append to or replace the output files (Google Cloud Dataflow)

Question

I cannot find any documentation on it, so I wonder what is the behavior if the output files already exist (in a gs:// bucket)?

Thanks, G

jkff · Accepted Answer

The files will be overwritten. There are several motivations for this:

The "report-like" use case (compute a summary of the input data and put the results on GCS) seems to be a lot more frequent than the use case where you are producing data incrementally and putting more of it onto GCS with each execution of the pipeline.
It is good if rerunning a pipeline is idempotent(-ish?). E.g. if you find a bug in your pipeline, you can just fix it and rerun it, and enjoy the overwritten correct results. A pipeline that appends to files would be very difficult to work with in this matter.
It is not required to specify the number of output shards for TextIO.Write; it can slightly differ between different executions, even for exactly the same pipeline and the same input data. The semantics of appending in that case would be very confusing.
Appending is, as far as I know, impossible to implement efficiently using any filesystem I'm aware of, while preserving the atomicity and fault tolerance guarantees (e.g. that you produce all output or none of it, even in the face of bundle re-executions due to failures).

This behavior will be documented in the next version of SDK that appears on github.

Donate For Us