How to import a custom module in a MapReduce job?

Tags:

I have a MapReduce job defined in main.py, which imports the lib module from lib.py. I use Hadoop Streaming to submit this job to the Hadoop cluster as follows:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files lib.py,main.py 
    -mapper "./main.py map" -reducer "./main.py reduce" 
    -input input -output output

In my understanding, this should put both main.py and lib.py into the distributed cache folder on each computing machine and thus make module lib available to main. But it doesn't happen: from the log I see that files are really copied to the same directory, but main can't import lib, throwing ImportError.

Why does this happen and how can I fix it?

UPD. Adding the current directory to the path didn't work:

import sys    
sys.path.append(os.path.realpath(__file__))
import lib
# ImportError

though, loading the module manually did the trick:

import imp
lib = imp.load_source('lib', 'lib.py')

But that's not what I want. So why does the Python interpreter see other .py files in the same directory, but can't import them? Note that I have already tried adding an empty __init__.py file to the same directory without effect.

918

asked Aug 09 '13 15:08

ffriend

2 Answers

I posted the question to Hadoop user list and finally found the answer. It turns out that Hadoop doesn't really copy files to the location where the command runs, but instead creates symlinks for them. Python, in its turn, can't work with symlinks and thus doesn't recognize lib.py as Python module.

Simple workaround here is to put both main.py and lib.py into the same directory, so that symlink to the directory is placed into MR job working directory, while both files are physically in the same directory. So I did the following:

Put main.py and lib.py into app directory.
In main.py I used lib.py directly, that is, import string is just

import lib
Uploaded app directory with -files option.

So, final command looks like this:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files app 
       -mapper "app/main.py map" -reducer "app/main.py reduce" 
       -input input -output output

173

answered Oct 10 '22 12:10

ffriend

When Hadoop-Streaming starts the python scripts, your python script's path is where the script file really is. However, hadoop starts them at './', and your lib.py(it's a symlink) is at './', too. So, try to add 'sys.path.append("./")' before you import lib.py like this: import sys sys.path.append('./') import lib

answered Oct 10 '22 12:10

Muyoo

Related questions
                            
                                NumPy ImportError in python - Dll load failed
                            
                                How do I convert a list of dictionaries to a dictionary of lists in Python?
                            
                                PyInstaller won't load the PyQt's images to the GUI
                            
                                Unable to get a setting from settings file in django
                            
                                How do I get the URL of the active Google Chrome tab in Windows?
                            
                                Why was an old .pyc file breaking Django?
                            
                                How does Udacity web Python interpreter work?
                            
                                python crypt in OSX
                            
                                Fastest way to check if a string contains specific characters in any of the items in a list
                            
                                subprocess call in python to invoke java jar files with JAVA_OPTS
                            
                                What is the nature of the round off error here?
                            
                                shutil.copytree without files
                            
                                breaking while loop with function? [duplicate]
                            
                                Why is `input` in Python 3 throwing NameError: name... is not defined [duplicate]
                            
                                How to parse a string and return a nested array?
                            
                                How do I combine two numpy arrays element wise in python?
                            
                                Beautifulsoup sibling structure with br tags
                            
                                How to 'pickle' an object to a certain directory?
                            
                                Can't upgrade Scipy
                            
                                String comparison in python words ending with

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to import a custom module in a MapReduce job?

Tags:

python

mapreduce

hadoop-streaming

ffriend

People also ask

2 Answers

ffriend

Muyoo

Recent Activity

Donate For Us