Run MRJob from IPython notebook

Question

I'm trying to run mrjob example from IPython notebook

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

def mapper(self, _, line):
    yield "chars", len(line)
    yield "words", len(line.split())
    yield "lines", 1

def reducer(self, key, values):
    yield key, sum(values)

then run it with code

mr_job = MRWordFrequencyCount(args=["testfile.txt"])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        print key, value

and getting the error:

TypeError: <module '__main__' (built-in)> is a built-in class

Is there way to run mrjob from IPython notebook?

qff · Accepted Answer

I haven't found the "perfect way" yet, but one thing you can do is create one notebook cell, using the %%file magic, writing the cell contents to a file:

%%file wordcount.py
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)

And then have mrjob run that file in a later cell:

import wordcount
reload(wordcount)

mr_job = wordcount.MRWordFrequencyCount(args=['example.txt'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        print key, value

Notice that I called my file wordcount.py and that I import the class MRWordFrequencyCount from the wordcount module -- the filename and module has to match. Also Python caches imported modules and when you change the wordcount.py-file iPython will not reload the module but rather used the old, cached one. That's why I put the reload() call in there.

Reference: https://groups.google.com/d/msg/mrjob/CfdAgcEaC-I/8XfJPXCjTvQJ

Update (shorter)
For a shorter second notebook cell you can run the mrjob by invoking the shell from within the notebook

! python mrjob.py shakespeare.txt

Reference: http://jupyter.cs.brynmawr.edu/hub/dblank/public/Jupyter%20Magics.ipynb

sapo_cosmico · Answer

I suspect it is due to this limitation stated on the MRJob website:

The file with the job class is sent to Hadoop to be run. Therefore, the job file cannot attempt to start the Hadoop job, or you would be recursively creating Hadoop jobs!The code that runs the job should only run outside of the Hadoop context.

Alternatively, it might be because you didn't have the following (reference):

if __name__ == '__main__':  
  MRWordCounter.run()  # where MRWordCounter is your job class

Run MRJob from IPython notebook

Tags:

python

ipython-notebook

mapreduce

mrjob

szu

2 Answers

qff

sapo_cosmico

Recent Activity

Donate For Us

Run MRJob from IPython notebook

Tags:

python

ipython-notebook

mapreduce

mrjob

szu

2 Answers

qff

sapo_cosmico

Related questions

Recent Activity

Donate For Us