I'm trying to run mrjob example from IPython notebook
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
then run it with code
mr_job = MRWordFrequencyCount(args=["testfile.txt"])
with mr_job.make_runner() as runner:
runner.run()
for line in runner.stream_output():
key, value = mr_job.parse_output_line(line)
print key, value
and getting the error:
TypeError: <module '__main__' (built-in)> is a built-in class
Is there way to run mrjob from IPython notebook?
I haven't found the "perfect way" yet, but one thing you can do is create one notebook cell, using the %%file
magic, writing the cell contents to a file:
%%file wordcount.py
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
And then have mrjob
run that file in a later cell:
import wordcount
reload(wordcount)
mr_job = wordcount.MRWordFrequencyCount(args=['example.txt'])
with mr_job.make_runner() as runner:
runner.run()
for line in runner.stream_output():
key, value = mr_job.parse_output_line(line)
print key, value
Notice that I called my file wordcount.py
and that I import the class MRWordFrequencyCount
from the wordcount
module -- the filename and module has to match. Also Python caches imported modules and when you change the wordcount.py
-file iPython will not reload the module but rather used the old, cached one. That's why I put the reload()
call in there.
Reference: https://groups.google.com/d/msg/mrjob/CfdAgcEaC-I/8XfJPXCjTvQJ
Update (shorter)
For a shorter second notebook cell you can run the mrjob by invoking the shell from within the notebook
! python mrjob.py shakespeare.txt
Reference: http://jupyter.cs.brynmawr.edu/hub/dblank/public/Jupyter%20Magics.ipynb
I suspect it is due to this limitation stated on the MRJob website:
The file with the job class is sent to Hadoop to be run. Therefore, the job file cannot attempt to start the Hadoop job, or you would be recursively creating Hadoop jobs!The code that runs the job should only run outside of the Hadoop context.
Alternatively, it might be because you didn't have the following (reference):
if __name__ == '__main__':
MRWordCounter.run() # where MRWordCounter is your job class
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With