Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pydoop on Amazon EMR

How would I use Pydoop on Amazon EMR?

I tried googling this topic to no avail: is it at all possible?

like image 577
jldupont Avatar asked May 24 '12 02:05

jldupont


People also ask

What is Pydoop?

Pydoop is a Python interface to Hadoop that allows you to write MapReduce applications and interact with HDFS in pure Python.

What can you do using AWS EMR on S3?

EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency. Previously, Amazon EMR used the s3n and s3a file systems.

Does EMR work with S3?

HDFS and the EMR File System (EMRFS), which uses Amazon S3, are both compatible with Amazon EMR, but they're not interchangeable.


1 Answers

I finally got this working. Everything happens on the master node...ssh to that node as the user hadoop

You need some packages:

sudo easy_install argparse importlib
sudo apt-get update
sudo apt-get install libboost-python-dev

To build stuff:

wget http://apache.mirrors.pair.com/hadoop/common/hadoop-0.20.205.0/hadoop-0.20.205.0.tar.gz
wget http://sourceforge.net/projects/pydoop/files/Pydoop-0.6/pydoop-0.6.0.tar.gz
tar xvf hadoop-0.20.205.0.tar.gz
tar xvf pydoop-0.6.0.tar.gz

export JAVA_HOME=/usr/lib/jvm/java-6-sun 
export JVM_ARCH=64 # I assume that 32 works for 32-bit systems
export HADOOP_HOME=/home/hadoop
export HADOOP_CPP_SRC=/home/hadoop/hadoop-0.20.205.0/src/c++/
export HADOOP_VERSION=0.20.205
export HDFS_LINK=/home/hadoop/hadoop-0.20.205.0/src/c++/libhdfs/

cd ~/hadoop-0.20.205.0/src/c++/libhdfs
sh ./configure
make
make install
cd ../install
tar cvfz ~/libhdfs.tar.gz lib
sudo tar xvf ~/libhdfs.tar.gz -C /usr

cd ~/pydoop-0.6.0
python setup.py bdist
cp dist/pydoop-0.6.0.linux-x86_64.tar.gz ~/
sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C /

Save the two tarballs and in the future, you can skip the build part and simply do the following to install (need to figure out how to do this a boostrap option for installing on multi node clusters)

sudo tar xvf ~/libhdfs.tar.gz -C /usr
sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C /

I was then able to run the example program using the Full-fledged Hadoop API (after fixing a bug in the constructor so that it calls super(WordCountMapper, self)).

#!/usr/bin/python

import pydoop.pipes as pp

class WordCountMapper(pp.Mapper):

  def __init__(self, context):
    super(WordCountMapper, self).__init__(context)
    context.setStatus("initializing")
    self.input_words = context.getCounter("WORDCOUNT", "INPUT_WORDS")

  def map(self, context):
    words = context.getInputValue().split()
    for w in words:
      context.emit(w, "1")
    context.incrementCounter(self.input_words, len(words))

class WordCountReducer(pp.Reducer):

  def reduce(self, context):
    s = 0
    while context.nextValue():
      s += int(context.getInputValue())
    context.emit(context.getInputKey(), str(s))

pp.runTask(pp.Factory(WordCountMapper, WordCountReducer))

I uploaded that program to a bucket and called it run. I then used the following conf.xml:

<?xml version="1.0"?>
<configuration>

<property>
  <name>hadoop.pipes.executable</name>
  <value>s3://<my bucket>/run</value>
</property>

<property>
  <name>mapred.job.name</name>
  <value>myjobname</value>
</property>

<property>
  <name>hadoop.pipes.java.recordreader</name>
  <value>true</value>
</property>

<property>
  <name>hadoop.pipes.java.recordwriter</name>
  <value>true</value>
</property>

</configuration>

Finally, I used the following command line:

hadoop pipes -conf conf.xml -input s3://elasticmapreduce/samples/wordcount/input -output s3://tmp.nou/asdf
like image 131
Nathan Binkert Avatar answered Nov 01 '22 16:11

Nathan Binkert