I'm attempting to use Jython for an implementation within Hadoop 1.2.1. I have seen strikingly little about Jython+Hadoop other than stale projects (like code.google.com/p/happy), and a stale implementation in $HADOOP_HOME/src/examples/python/WordCount.py
, so perhaps I'm barking up the wrong tree to begin with... but this seems reasonable and possible. I am also very aware of Hadoop Streaming, with which I can use Python in Hadoop without using Jython, but that's not what I'm trying to do here.
Basically, when I invoke the embedded/standalone Jython jar file using java -jar /full/path/to/myjythonjar.jar
, the /full/path/to/myjythonjar.jar/Lib
is in my Python sys.path
, but when I invoke using bin/hadoop jar /full/path/to/myjythonjar.jar input output
the ...jar/Lib
is not in my path, and the script can't find the Python modules I'm referencing.
Here's what I'm doing...
I'm using the standalone version of the Jython jar, and using the JarRunner
interface, roughly as described on SO here and other places; essentially as follows:
cp jython-standalone-2.7-b1.jar jythonsalib_test.jar
jar ufe jythonsalib_test.jar org.python.util.JarRunner __run__.py
That is, take a copy of the standalone jar, add my script with name __run__.py
, and change the Manifest to execute JarRunner
-- many thanks to @Frank Wierzbicki for that gem.
This all works fine when I'm running directly as, e.g.,
java -jar jythonsalib_test.jar
My sys.path
reports that it includes '/full/path/to/jar/file/jythonsalib_test.jar/Lib'
, which is exactly what I expect, and it is the path from which I'm getting the Python modules (empirically tested by setting sys.path
to null-list (fails) and ONLY that path (works)).
When I run this same jar in Hadoop, e.g., as
bin/hadoop jar /full/path/to/jar/file/jythonsalib_test.jar input output
sys.path
only includes
['__classpath__', '__pyclasspath__']
I've also used the Jython standalone jar versions 2.5.4-rc1 (which has the same behavior described above) and 2.5.3 (that doesn't work for me for unrelated reasons).
As pointed out in other SO answers, the workaround I'm currently using is basically to directly add my Lib directory of my jar, inside of the Jython script like
import sys
sys.path.append('/full/path/to/jar/file/jythonsalib_test.jar/Lib')
And this basically works -- but this is meant to be a distributed application! There is no path that I can reference in this way. Other SO articles suggest various mechanisms, but are all basically adding to library paths (again, no links because I have <10rep) by Python like above, Java, or Jython installation or Jython "registry" (startup/rc) files. Sure, I could use HDFS or bootstrapping mechanisms or other mechanisms to distribute something to the compute nodes, like the jar or Jython or whatever, but the code is already in the jar! So I shouldn't need to distribute it again, separately...
So, in sum: It looks like I need to be on a filesystem that can directly, and separately, reference the jar file containing Python modules. (akin to the old java -jar jythonjar.jar -jar jythonjar.jar
) How do I convince an embedded, standalone Jython jar to always use the Python modules with in the Lib subdirectory of the Jar file, without separately pointing to (potentially the same) jar file?
Or: how do I add a relative path link to the current jar file...? Or am I missing something more insidious and fundamental about Hadoop or Jython or Java or...?
I had a boatload more links, but SO tells me that I can only have TWO links because I'm new here. I hope some day to get enough rep to be able to truly contribute to this fantastic site! :)
Anyway. LTWFTW -- long time watcher, first time writer -- many thanks!
I wonder if packaging your app with OneJar would improve things. Please try and report back. I´m just shooting in the dark here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With