So, it is easy enough to handle external jars when using hadoop straight up. You have -libjars option that will do this for you. The question is how do you do this with EMR. There must be an easy way of doing it. I thought -cachefile option of the CLI would do it, but I couldn't get it working somehow. Any ideas anyone?
Thanks for the help.
The best luck I have had with external jar dependencies is to copy them (via bootstrap action) to /home/hadoop/lib
throughout the cluster. That path is on the classpath of every host. This technique is the only one that seems to work regardless of where the code lives that accesses external jars (tool, job, or task).
One option is to have the first step in your jobflow set up the JARs wherever they need to be. Or, if they are dependencies, you can package them in with your application JAR (which is probably in S3).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With