What's the best Python implementation for MapReduce
, a framework or a library, probably as good as Apache hadoop
one, but if only it's in Python and best in terms of good documented and easy understanding, fully implemented for MapReduce
pattern, high scalability, high stability, and lightweight.
I googled one called mincemeat
, not sure about it, but any others well known?
Thanks
There are some pieces here and there if you search for them. For example Octopy and Disco as well as Hadoopy.
However, I don't believe that any of them can compete Hadoop in terms of maturity, stability, scalability, performance, etc. For small cases they should suffice, but for something more "glorious", you have to stick to Hadoop.
Remember that you can still write map/reduce programs in Hadoop with python/jython.
EDIT : I've recently came across mrjob. It seems great, as it eases the way to write map/reduce programs and then launch them on Hadoop or on Amazon's Elastic MapReduce platform. The article that brough the good news is here
Update in 2019: Would highly recommend Apache Beam.
===
Another good option is Dumbo.
Below is the code to run a map/reduce for word counting.
def mapper(key,value):
for word in value.split(): yield word,1
def reducer(key,values):
yield key,sum(values)
if __name__ == "__main__":
import dumbo
dumbo.run(mapper,reducer)
To run it, just feed your text file wc_input.txt
for counting, the output is saved as wc_output
.
python -m dumbo wordcount.py -hadoop /path/to/hadoop -input wc_input.txt -output wc_output
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With