Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best python implementation for mapReduce pattern?

What's the best Python implementation for MapReduce, a framework or a library, probably as good as Apache hadoop one, but if only it's in Python and best in terms of good documented and easy understanding, fully implemented for MapReduce pattern, high scalability, high stability, and lightweight.

I googled one called mincemeat, not sure about it, but any others well known?

Thanks

like image 686
leslie Avatar asked Dec 13 '22 09:12

leslie


2 Answers

There are some pieces here and there if you search for them. For example Octopy and Disco as well as Hadoopy.

However, I don't believe that any of them can compete Hadoop in terms of maturity, stability, scalability, performance, etc. For small cases they should suffice, but for something more "glorious", you have to stick to Hadoop.

Remember that you can still write map/reduce programs in Hadoop with python/jython.

EDIT : I've recently came across mrjob. It seems great, as it eases the way to write map/reduce programs and then launch them on Hadoop or on Amazon's Elastic MapReduce platform. The article that brough the good news is here

like image 96
hymloth Avatar answered Dec 28 '22 23:12

hymloth


Update in 2019: Would highly recommend Apache Beam.

===

Another good option is Dumbo.

Below is the code to run a map/reduce for word counting.

def mapper(key,value):
  for word in value.split(): yield word,1
def reducer(key,values):
  yield key,sum(values)

if __name__ == "__main__":
  import dumbo
  dumbo.run(mapper,reducer)

To run it, just feed your text file wc_input.txt for counting, the output is saved as wc_output.

 python -m dumbo wordcount.py -hadoop /path/to/hadoop -input wc_input.txt -output wc_output
like image 32
greeness Avatar answered Dec 28 '22 22:12

greeness