How do I make Hadoop find imported Python modules when using Python UDFs in Pig?

Tags:

I am using Pig (0.9.1) with UDFs written in Python. The Python scripts import modules from the standard Python library. I have been able to run the Pig scrips that call the Python UDFs successfully in local mode, but when I run on the cluster it appears Pig's generated Hadoop job is unable to find the imported modules. What needs to be done?

For example:

Does python (or jython) need to be installed on each task tracker node?
Do the python (or jython) modules need to be installed on each task tracker node?
Do the task tracker nodes need to know how to find the modules?
If so, how do you specify the path (via an environment variable - how is that done for the task tracker)?

796

asked Oct 20 '11 05:10

Ben Lever

Video Answer

1 Answers

Does python (or jython) need to be installed on each task tracker node?

Yes, since it's executed in task trackers.

Do the python (or jython) modules need to be installed on each task tracker node?

If you are using a 3rd party module, it should be installed in task trackers as well (like geoip, etc).

Do the task tracker nodes need to know how to find the modules? If so, how do you specify the path (via an environment variable - how is that done for the task tracker)?

As an answer from the book "Programming Pig" :

register is also used to locate resources for Python UDFs that you use in your Pig Latin scripts. In this case you do not register a jar, but rather a Python script that contains your UDF. The Python script must be in your current directory.

And also this one is important :

A caveat, Pig does not trace dependencies inside your Python scripts and send the needed Python modules to your Hadoop cluster. You are required to make sure the modules you need reside on the task nodes in your cluster and that the PYTHONPATH environment variable is set on those nodes such that your UDFs will be able to find them for import. This issue has been fixed after 0.9, but as of this writing not yet released.

And if you are using jython :

Pig does not know where on your system the Jython interpreter is, so you must include jython.jar in your classpath when invoking Pig. This can be done by setting the PIG_CLASSPATH environment variable.

As a summary, if you are using streaming then you can use "SHIP" command in pig which would send your executable files to cluster. if you are using UDF, as long as it can be compiled(check the note about jython) and doesn't have 3rd party dependency in it (which you didn't already put in PYTHONPATH / or installed in cluster), the UDF would be shipped to cluster when executed. (As a tip, it would make your life much more easier if you put your simple UDF dependencies in the same folder with pig script when registering)

Hope these would clear things.

answered Oct 14 '22 08:10

frail

Related questions
                            
                                How to set a file's ctime with Python? [duplicate]
                            
                                Is it possible to get a "high water mark" of memory usage from Python?
                            
                                Trypsin digest (cleavage) does not work using regular expression
                            
                                Modifying axes on matplotlib colorbar plot of 2D array
                            
                                How do I get Python XML to stop having wasted Child Nodes
                            
                                python unittest: can't call decorated test
                            
                                Install Python extension to specific location
                            
                                using python finditer, how can I replace each matched string?
                            
                                Vim failing to compile with python on OS X
                            
                                How do I know which versions of pickle a particular version of Python supports?
                            
                                Deleting large numbers of files in python
                            
                                Do Python generator objects become "unusable" after being traversed?
                            
                                Python urllib and urllib2 not opening localhost URLs?
                            
                                Data compression in python/numpy
                            
                                Run External Python Programs with Eclipse PyDev
                            
                                IMDB to MySQL: Insert IMDB data into MySQL database
                            
                                Asyncore loop and raw_input problem
                            
                                Should I use epoll or just blocking recv in threads?
                            
                                SQLAlchemy: Re-saving model's unique field after trying to save non-unique value
                            
                                Python equivalent of Perl's idiom do this or that, usually known as "or die"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I make Hadoop find imported Python modules when using Python UDFs in Pig?

Tags:

python

jython

hadoop

apache-pig

Ben Lever

People also ask

Video Answer

1 Answers

frail

Recent Activity

Donate For Us