Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to manage conflicting DataProc Guava, Protobuf, and GRPC dependencies

I am working on a scala Spark job which needs to use java library (youtube/vitess) which is dependent upon newer versions of GRPC (1.01), Guava (19.0), and Protobuf (3.0.0) than currently provided on the DataProc 1.1 image.

When running the project locally and building with maven, the correct versions of these dependencies are loaded an the job will run without issue. When submitting the job to DataProc, the DataProc version of these libraries are preferred and the job will reference class functions that cannot be resolved.

What is the recommended way of ensuring that the right version of a dependency's dependencies get loaded when submitting a Spark job on DataProc? I'm not in a position to rewrite components of this library to use the older versions of these packages that are being provided by DataProc.

like image 751
Smith Avatar asked Oct 17 '22 22:10

Smith


1 Answers

Recommended approach is to include all dependencies for your job into uber jar (created using Maven Shade plugin, for example) and relocate dependencies classes inside this uber jar to avoid conflicts with classes in libraries provided by Dataproc.

For reference, you can take a look at how this is done in Cloud Storage connector which is a part of Dataproc distribution.

like image 95
Igor Dvorzhak Avatar answered Oct 21 '22 03:10

Igor Dvorzhak