Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

use docker for google cloud data flow dependencies

I am interested in using Google cloud Dataflow to parallel process videos. My job uses both OpenCV and tensorflow. Is it possible to just run the workers inside a docker instance, rather than installing all the dependencies from source as described:

https://cloud.google.com/dataflow/pipelines/dependencies-python

I would have expected a flag for a docker container, which is already sitting in google container engine.

like image 419
bw4sz Avatar asked Jun 21 '17 18:06

bw4sz


1 Answers

2021 update

Dataflow now supports custom docker containers. You can create your own container by following these instructions:

https://cloud.google.com/dataflow/docs/guides/using-custom-containers

The short answer is that Beam publishes containers under dockerhub.io/apache/beam_${language}_sdk:${version}.

In your Dockerfile you would use one of them as base:

FROM apache/beam_python3.8_sdk:2.30.0
# Add your customizations and dependencies

Then you would upload this image to a container registry like GCR or Dockerhub, and then you would specify the following option: --worker_harness_container_image=$IMAGE_URI

And bing! you have a customer container.


It is not possible to modify or switch the default Dataflow worker container. You need to install the dependencies according to the documentation.

like image 97
Pablo Avatar answered Oct 13 '22 00:10

Pablo