Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

easiest way to schedule a Google Cloud Dataflow job

I just need to run a dataflow pipeline on a daily basis, but it seems to me that suggested solutions like App Engine Cron Service, which requires building a whole web app, seems a bit too much. I was thinking about just running the pipeline from a cron job in a Compute Engine Linux VM, but maybe that's far too simple :). What's the problem with doing it that way, why isn't anybody (besides me I guess) suggesting it?

like image 397
CCC Avatar asked May 06 '17 04:05

CCC


People also ask

How do I schedule a dataflow in Google cloud?

Go to the Dataflow Pipelines page in the Google Cloud console, then select +Create data pipeline. On the Create pipeline from template page, provide a pipeline name, and fill in the other template selection and parameter fields. For a batch job, in the Schedule your pipeline section, provide a recurrence schedule.


1 Answers

This is how I did it using Cloud Functions, PubSub, and Cloud Scheduler (this assumes you've already created a Dataflow template and it exists in your GCS bucket somewhere)

  1. Create a new topic in PubSub. this will be used to trigger the Cloud Function

  2. Create a Cloud Function that launches a Dataflow job from a template. I find it easiest to just create this from the CF Console. Make sure the service account you choose has permission to create a dataflow job. the function's index.js looks something like:

const google = require('googleapis');

exports.triggerTemplate = (event, context) => {
  // in this case the PubSub message payload and attributes are not used
  // but can be used to pass parameters needed by the Dataflow template
  const pubsubMessage = event.data;
  console.log(Buffer.from(pubsubMessage, 'base64').toString());
  console.log(event.attributes);

  google.google.auth.getApplicationDefault(function (err, authClient, projectId) {
  if (err) {
    console.error('Error occurred: ' + err.toString());
    throw new Error(err);
  }

  const dataflow = google.google.dataflow({ version: 'v1b3', auth: authClient });

  dataflow.projects.templates.create({
        projectId: projectId,
        resource: {
          parameters: {},
          jobName: 'SOME-DATAFLOW-JOB-NAME',
          gcsPath: 'gs://PATH-TO-YOUR-TEMPLATE'
        }
      }, function(err, response) {
        if (err) {
          console.error("Problem running dataflow template, error was: ", err);
        }
        console.log("Dataflow template response: ", response);
      });
  });
};

The package.json looks like

{
  "name": "pubsub-trigger-template",
  "version": "0.0.1",
  "dependencies": {
    "googleapis": "37.1.0",
    "@google-cloud/pubsub": "^0.18.0"
  }
}
  1. Go to PubSub and the topic you created, manually publish a message. this should trigger the Cloud Function and start a Dataflow job

  2. Use Cloud Scheduler to publish a PubSub message on schedule https://cloud.google.com/scheduler/docs/tut-pub-sub

like image 179
twang Avatar answered Sep 20 '22 14:09

twang