Writing Output of a Dataflow Pipeline to a Partitioned Destination

Question

We have a single streaming event source with thousands of events per second, these events are all marked with an id identifying which of our tens of thousands of customers the event belongs to. We'd like to use this event source to populate a data warehouse (in streaming mode), however, our event source is not persistent, so we'd also like to archive the raw data in GCS so we can replay it through our data warehouse pipeline if we make a change that requires it. Because of data retention requirements, any raw data that we persist needs to be partitioned by customer, so that we can easily delete it.

What would the simplest way to solve this in Dataflow be? Currently we're creating a dataflow job with a custom sink that writes the data to files per-customer on GCS/BigQuery, is that sensible?

Nick · Accepted Answer

In order to specify the filename and path, please see the TextIO documentation. You would provide the filename / path etc. to the output writer.

For your use case of multiple output files, you can use the Partition function to create multiple PCollections out of a single source PCollection.

Writing Output of a Dataflow Pipeline to a Partitioned Destination

Tags:

google-cloud-storage

google-cloud-dataflow

Narek

1 Answers

Nick

Recent Activity

Donate For Us

Writing Output of a Dataflow Pipeline to a Partitioned Destination

Tags:

google-cloud-storage

google-cloud-dataflow

Narek

1 Answers

Nick

Related questions

Recent Activity

Donate For Us