Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom VM images for Google Cloud Dataflow workers

Having skimmed the Google Cloud Dataflow documentation, my impression is that worker VMs run a specific predefined Python 2.7 environment without any option to change that. Is it possible to provide a custom VM image for the workers (built with libraries, external commands that the particular application needs). Is it possible to run Python 3 on Gcloud Dataflow?

like image 767
sandris Avatar asked Feb 14 '18 13:02

sandris


People also ask

What is difference between Dataproc and dataflow?

Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem. Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine.


1 Answers

2021 Update

As of today, the answer to both of this questions is YES.

  1. Python 3 is supported on Dataflow.
  2. Custom container images are supported on Dataflow, see this SO answer, and this docs page.

Is it possible to provide a custom VM image for the workers (built with libraries, external commands that the particular application needs). Is it possible to run Python 3 on Gcloud Dataflow?

No and no to both questions. You're able to configure Compute Engine instance machine type and disk size for a Dataflow job, but you're not able to configure things like installed applications. Currently, Apache Beam does not support Python 3.x.

References:

  1. https://cloud.google.com/dataflow/pipelines/specifying-exec-params
  2. https://issues.apache.org/jira/browse/BEAM-1251
  3. https://beam.apache.org/get-started/quickstart-py/
like image 133
Andrew Nguonly Avatar answered Oct 08 '22 23:10

Andrew Nguonly