We have a single streaming event source with thousands of events per second, these events are all marked with an id identifying which of our tens of thousands of customers the event belongs to. We'd like to use this event source to populate a data warehouse (in streaming mode), however, our event source is not persistent, so we'd also like to archive the raw data in GCS so we can replay it through our data warehouse pipeline if we make a change that requires it. Because of data retention requirements, any raw data that we persist needs to be partitioned by customer, so that we can easily delete it.
What would the simplest way to solve this in Dataflow be? Currently we're creating a dataflow job with a custom sink that writes the data to files per-customer on GCS/BigQuery, is that sensible?
In order to specify the filename and path, please see the TextIO documentation. You would provide the filename / path etc. to the output writer.
For your use case of multiple output files, you can use the Partition function to create multiple PCollections
out of a single source PCollection
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With