I need to use JNI in my Dataflow pipeline. The JNI uses C++ library that has a ton of external dependencies on other system libraries. What would be the best way to make sure that the libraries are where they should be in the operating system when a worker runs the DoFn that uses the C++ library?
I found that the DataflowPipelineOptions.setWorkerHarnessContainerImage might allow me to specify custom docker image from the Google Container Registry that I could potentially install bunch of libraries on, but the documentation doesn't say much more. Are there any requirements for the docker image in terms of installed packages, entry points, etc...?
Apache Beam recently published an example of calling sub-processes from a Dataflow worker. The solution downloads the binary dynamically within the DoFn's @Setup method and then executes the binary for each record processed by the pipeline. The solution also handles collecting the output from the process and propagating failures to the pipeline.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With