Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run MRJob from IPython notebook

I'm trying to run mrjob example from IPython notebook

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

def mapper(self, _, line):
    yield "chars", len(line)
    yield "words", len(line.split())
    yield "lines", 1

def reducer(self, key, values):
    yield key, sum(values)  

then run it with code

mr_job = MRWordFrequencyCount(args=["testfile.txt"])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        print key, value

and getting the error:

TypeError: <module '__main__' (built-in)> is a built-in class

Is there way to run mrjob from IPython notebook?

like image 896
szu Avatar asked Oct 01 '22 08:10

szu


2 Answers

I haven't found the "perfect way" yet, but one thing you can do is create one notebook cell, using the %%file magic, writing the cell contents to a file:

%%file wordcount.py
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)

And then have mrjob run that file in a later cell:

import wordcount
reload(wordcount)

mr_job = wordcount.MRWordFrequencyCount(args=['example.txt'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        print key, value

Notice that I called my file wordcount.py and that I import the class MRWordFrequencyCount from the wordcount module -- the filename and module has to match. Also Python caches imported modules and when you change the wordcount.py-file iPython will not reload the module but rather used the old, cached one. That's why I put the reload() call in there.

Reference: https://groups.google.com/d/msg/mrjob/CfdAgcEaC-I/8XfJPXCjTvQJ

Update (shorter)
For a shorter second notebook cell you can run the mrjob by invoking the shell from within the notebook

! python mrjob.py shakespeare.txt

Reference: http://jupyter.cs.brynmawr.edu/hub/dblank/public/Jupyter%20Magics.ipynb

like image 53
qff Avatar answered Oct 02 '22 23:10

qff


I suspect it is due to this limitation stated on the MRJob website:

The file with the job class is sent to Hadoop to be run. Therefore, the job file cannot attempt to start the Hadoop job, or you would be recursively creating Hadoop jobs!The code that runs the job should only run outside of the Hadoop context.

Alternatively, it might be because you didn't have the following (reference):

if __name__ == '__main__':  
  MRWordCounter.run()  # where MRWordCounter is your job class
like image 22
sapo_cosmico Avatar answered Oct 02 '22 23:10

sapo_cosmico