Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JNI in Dataflow

I need to use JNI in my Dataflow pipeline. The JNI uses C++ library that has a ton of external dependencies on other system libraries. What would be the best way to make sure that the libraries are where they should be in the operating system when a worker runs the DoFn that uses the C++ library?

I found that the DataflowPipelineOptions.setWorkerHarnessContainerImage might allow me to specify custom docker image from the Google Container Registry that I could potentially install bunch of libraries on, but the documentation doesn't say much more. Are there any requirements for the docker image in terms of installed packages, entry points, etc...?

like image 296
stepanbujnak Avatar asked Oct 16 '22 23:10

stepanbujnak


1 Answers

Apache Beam recently published an example of calling sub-processes from a Dataflow worker. The solution downloads the binary dynamically within the DoFn's @Setup method and then executes the binary for each record processed by the pipeline. The solution also handles collecting the output from the process and propagating failures to the pipeline.

like image 160
Ryan McDowell Avatar answered Oct 21 '22 04:10

Ryan McDowell