Having skimmed the Google Cloud Dataflow documentation, my impression is that worker VMs run a specific predefined Python 2.7 environment without any option to change that. Is it possible to provide a custom VM image for the workers (built with libraries, external commands that the particular application needs). Is it possible to run Python 3 on Gcloud Dataflow?
Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem. Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine.
2021 Update
As of today, the answer to both of this questions is YES.
Is it possible to provide a custom VM image for the workers (built with libraries, external commands that the particular application needs). Is it possible to run Python 3 on Gcloud Dataflow?
No and no to both questions. You're able to configure Compute Engine instance machine type and disk size for a Dataflow job, but you're not able to configure things like installed applications. Currently, Apache Beam does not support Python 3.x.
References:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With