Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Google Cloud Dataprep monitor a GCS path for new files?

Google Cloud Dataprep seems great and we've used it to manually import static datasets, however I would like to execute it more than once so that it can consume new files uploaded to a GCS path. I can see that you can setup a schedule for Dataprep, but I cannot see anywhere in the import setup how it would process new files.

Is this possible? Seems like an obvious need - hopefully I've missed something obvious.

like image 908
Matt Byrne Avatar asked Nov 29 '17 06:11

Matt Byrne


2 Answers

A further update on this. Since my question a new release of Dataprep on Jan 23 2018 includes the ability to re-run dataflow jobs independently of Dataprep.

When you execute a Dataprep job it will generate a Dataflow template that you can use to trigger jobs manually in the future and it allows certain parameters to be passed in.

Steps to be able to trigger on new files (please note this is Beta so Google may change exact process):

  1. Create your flow and run your relevant flow/recipe. Iterate/repeat manually until you have your recipe how you want it. When you are happy run, run the job again (should be a job that appends data rather than replace since you likely want to append new content). It's probably a good idea to uncheck "Profile results" (new feature) to reduce overhead since this will be a repeatable job.
  2. Once complete, go to the Job details page and click Export Results button and there you should see a link to the Dataflow template. Copy the text. Note that the Dataflow template path with only be available for jobs executed after the Jan 23 2018 release since it was a new feature.
  3. You can then see how to trigger a dataflow job by going to DataFlow and selecting CREATE JOB FROM TEMPLATE, selecting Custom template and pasting in your template path. There you will see the parameters you can supply such as your GCS input path
  4. Write a Google Cloud Function that is triggered from a GCS write and using the details of the event execute the template with your file path as per step (3) above.
like image 181
Matt Byrne Avatar answered Oct 23 '22 06:10

Matt Byrne


You can add a GCS path as a dataset by clicking on the + icon left of the folder during the dataset (see screenshot). When you set up a scheduled job for a flow that uses this dataset, all files in that directory (including new files) will be picked up on each scheduled job run.

enter image description here

like image 31
Lars Grammel Avatar answered Oct 23 '22 08:10

Lars Grammel