How would I use Pydoop on Amazon EMR?
I tried googling this topic to no avail: is it at all possible?
Pydoop is a Python interface to Hadoop that allows you to write MapReduce applications and interact with HDFS in pure Python.
EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency. Previously, Amazon EMR used the s3n and s3a file systems.
HDFS and the EMR File System (EMRFS), which uses Amazon S3, are both compatible with Amazon EMR, but they're not interchangeable.
I finally got this working. Everything happens on the master node...ssh to that node as the user hadoop
You need some packages:
sudo easy_install argparse importlib
sudo apt-get update
sudo apt-get install libboost-python-dev
To build stuff:
wget http://apache.mirrors.pair.com/hadoop/common/hadoop-0.20.205.0/hadoop-0.20.205.0.tar.gz
wget http://sourceforge.net/projects/pydoop/files/Pydoop-0.6/pydoop-0.6.0.tar.gz
tar xvf hadoop-0.20.205.0.tar.gz
tar xvf pydoop-0.6.0.tar.gz
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export JVM_ARCH=64 # I assume that 32 works for 32-bit systems
export HADOOP_HOME=/home/hadoop
export HADOOP_CPP_SRC=/home/hadoop/hadoop-0.20.205.0/src/c++/
export HADOOP_VERSION=0.20.205
export HDFS_LINK=/home/hadoop/hadoop-0.20.205.0/src/c++/libhdfs/
cd ~/hadoop-0.20.205.0/src/c++/libhdfs
sh ./configure
make
make install
cd ../install
tar cvfz ~/libhdfs.tar.gz lib
sudo tar xvf ~/libhdfs.tar.gz -C /usr
cd ~/pydoop-0.6.0
python setup.py bdist
cp dist/pydoop-0.6.0.linux-x86_64.tar.gz ~/
sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C /
Save the two tarballs and in the future, you can skip the build part and simply do the following to install (need to figure out how to do this a boostrap option for installing on multi node clusters)
sudo tar xvf ~/libhdfs.tar.gz -C /usr
sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C /
I was then able to run the example program using the Full-fledged Hadoop API
(after fixing a bug in the constructor so that it calls super(WordCountMapper, self)
).
#!/usr/bin/python
import pydoop.pipes as pp
class WordCountMapper(pp.Mapper):
def __init__(self, context):
super(WordCountMapper, self).__init__(context)
context.setStatus("initializing")
self.input_words = context.getCounter("WORDCOUNT", "INPUT_WORDS")
def map(self, context):
words = context.getInputValue().split()
for w in words:
context.emit(w, "1")
context.incrementCounter(self.input_words, len(words))
class WordCountReducer(pp.Reducer):
def reduce(self, context):
s = 0
while context.nextValue():
s += int(context.getInputValue())
context.emit(context.getInputKey(), str(s))
pp.runTask(pp.Factory(WordCountMapper, WordCountReducer))
I uploaded that program to a bucket and called it run. I then used the following conf.xml:
<?xml version="1.0"?>
<configuration>
<property>
<name>hadoop.pipes.executable</name>
<value>s3://<my bucket>/run</value>
</property>
<property>
<name>mapred.job.name</name>
<value>myjobname</value>
</property>
<property>
<name>hadoop.pipes.java.recordreader</name>
<value>true</value>
</property>
<property>
<name>hadoop.pipes.java.recordwriter</name>
<value>true</value>
</property>
</configuration>
Finally, I used the following command line:
hadoop pipes -conf conf.xml -input s3://elasticmapreduce/samples/wordcount/input -output s3://tmp.nou/asdf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With