I am interested in using Google cloud Dataflow to parallel process videos. My job uses both OpenCV and tensorflow. Is it possible to just run the workers inside a docker instance, rather than installing all the dependencies from source as described:
https://cloud.google.com/dataflow/pipelines/dependencies-python
I would have expected a flag for a docker container, which is already sitting in google container engine.
2021 update
Dataflow now supports custom docker containers. You can create your own container by following these instructions:
https://cloud.google.com/dataflow/docs/guides/using-custom-containers
The short answer is that Beam publishes containers under dockerhub.io/apache/beam_${language}_sdk:${version}
.
In your Dockerfile you would use one of them as base:
FROM apache/beam_python3.8_sdk:2.30.0
# Add your customizations and dependencies
Then you would upload this image to a container registry like GCR or Dockerhub, and then you would specify the following option: --worker_harness_container_image=$IMAGE_URI
And bing! you have a customer container.
It is not possible to modify or switch the default Dataflow worker container. You need to install the dependencies according to the documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With