Google Cloud Dataprep seems great and we've used it to manually import static datasets, however I would like to execute it more than once so that it can consume new files uploaded to a GCS path. I can see that you can setup a schedule for Dataprep, but I cannot see anywhere in the import setup how it would process new files.
Is this possible? Seems like an obvious need - hopefully I've missed something obvious.
A further update on this. Since my question a new release of Dataprep on Jan 23 2018 includes the ability to re-run dataflow jobs independently of Dataprep.
When you execute a Dataprep job it will generate a Dataflow template that you can use to trigger jobs manually in the future and it allows certain parameters to be passed in.
Steps to be able to trigger on new files (please note this is Beta so Google may change exact process):
You can add a GCS path as a dataset by clicking on the + icon left of the folder during the dataset (see screenshot). When you set up a scheduled job for a flow that uses this dataset, all files in that directory (including new files) will be picked up on each scheduled job run.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With