Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run Google Cloud Dataflow job from App Engine?

After reading Cloud Dataflow docs, I am still not sure how can I run my dataflow job from App Engine. Is it possible? Is it relevant whether my backend written in Python or in Java? Thanks!

like image 897
deemson Avatar asked Apr 14 '15 09:04

deemson


People also ask

Can we trigger Dataflow job with Google Cloud function?

So the Trigger for Cloud Function is Google Cloud Storage's Finalise/Create Event, i.e., when a file is uploaded in a GCS bucket, the Cloud Function must trigger the Cloud dataflow. When I create a dataflow pipeline (batch) and I execute the pipeline, it creates a Dataflow pipeline template and creates a Dataflow job.

How do you run a Dataflow pipeline?

If you are looking for a step-by-step guide on how to create and deploy your first pipeline, use Dataflow's quickstarts for Java, Python, Go, or templates. After you construct and test your Apache Beam pipeline, you can use the Dataflow managed service to deploy and execute it.


2 Answers

Yes it is possibile, you need to use the "Streaming execution" as mentioned here.

Using Google Cloud Pub/Sub as a streaming source you can use it as "trigger" of your pipeline.

From App Engine you can do the "Pub" action to the Pub/Sub Hub with the REST API.

like image 92
aqquadro Avatar answered Sep 30 '22 17:09

aqquadro


One way would indeed be to use Pub/Sub from within App Engine to let Cloud Dataflow know when new data is available. The Cloud Dataflow job would then run continuously and App Engine would provide the data for processing.

A different approach would be to add the code that sets up the Cloud Dataflow pipeline to a class in App Engine (including the Dataflow SDK to your GAE project) and set the job options programatically as explained here:

https://cloud.google.com/dataflow/pipelines/specifying-exec-params

Make sure to set the 'runner' option to DataflowPipelineRunner, so it executes asynchronously on the Google Cloud Platform. Since the pipeline runner (that actually runs your pipeline) does not have to be the same as the code that initiates it, this code (up until pipeline.run() ) could be in App Engine.

You can then add an endpoint or servlet to GAE that when called, runs the code that sets up the pipeline.

To schedule even more, you could have a cron job in GAE that calls the endpoint that initiates the pipeline...

like image 24
GavinG Avatar answered Sep 30 '22 16:09

GavinG