Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing Output of a Dataflow Pipeline to a Partitioned Destination

We have a single streaming event source with thousands of events per second, these events are all marked with an id identifying which of our tens of thousands of customers the event belongs to. We'd like to use this event source to populate a data warehouse (in streaming mode), however, our event source is not persistent, so we'd also like to archive the raw data in GCS so we can replay it through our data warehouse pipeline if we make a change that requires it. Because of data retention requirements, any raw data that we persist needs to be partitioned by customer, so that we can easily delete it.

What would the simplest way to solve this in Dataflow be? Currently we're creating a dataflow job with a custom sink that writes the data to files per-customer on GCS/BigQuery, is that sensible?

like image 576
Narek Avatar asked Jan 14 '16 23:01

Narek


1 Answers

In order to specify the filename and path, please see the TextIO documentation. You would provide the filename / path etc. to the output writer.

For your use case of multiple output files, you can use the Partition function to create multiple PCollections out of a single source PCollection.

like image 123
Nick Avatar answered Oct 26 '22 02:10

Nick