How would I use Pydoop on Amazon EMR? I tried googling this topic to no avail: is it at all possible?

I finally got this working. Everything happens on the master node...ssh to that node as the user hadoop You need some packages: <pre class="prettyprint"><code>sudo easy_install argparse importlib sudo apt-get update sudo apt-get install libboost-python-dev </code></pre> To build stuff: <pre class="prettyprint"><code>wget http://apache.mirrors.pair.com/hadoop/common/hadoop-0.20.205.0/hadoop-0.20.205.0.tar.gz wget http://sourceforge.net/projects/pydoop/files/Pydoop-0.6/pydoop-0.6.0.tar.gz tar xvf hadoop-0.20.205.0.tar.gz tar xvf pydoop-0.6.0.tar.gz export JAVA_HOME=/usr/lib/jvm/java-6-sun export JVM_ARCH=64 # I assume that 32 works for 32-bit systems export HADOOP_HOME=/home/hadoop export HADOOP_CPP_SRC=/home/hadoop/hadoop-0.20.205.0/src/c++/ export HADOOP_VERSION=0.20.205 export HDFS_LINK=/home/hadoop/hadoop-0.20.205.0/src/c++/libhdfs/ cd ~/hadoop-0.20.205.0/src/c++/libhdfs sh ./configure make make install cd ../install tar cvfz ~/libhdfs.tar.gz lib sudo tar xvf ~/libhdfs.tar.gz -C /usr cd ~/pydoop-0.6.0 python setup.py bdist cp dist/pydoop-0.6.0.linux-x86_64.tar.gz ~/ sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C / </code></pre> Save the two tarballs and in the future, you can skip the build part and simply do the following to install (need to figure out how to do this a boostrap option for installing on multi node clusters) <pre class="prettyprint"><code>sudo tar xvf ~/libhdfs.tar.gz -C /usr sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C / </code></pre> I was then able to run the example program using the Full-fledged Hadoop API (after fixing a bug in the constructor so that it calls <code>super(WordCountMapper, self)</code>). <pre class="prettyprint"><code>#!/usr/bin/python import pydoop.pipes as pp class WordCountMapper(pp.Mapper): def __init__(self, context): super(WordCountMapper, self).__init__(context) context.setStatus("initializing") self.input_words = context.getCounter("WORDCOUNT", "INPUT_WORDS") def map(self, context): words = context.getInputValue().split() for w in words: context.emit(w, "1") context.incrementCounter(self.input_words, len(words)) class WordCountReducer(pp.Reducer): def reduce(self, context): s = 0 while context.nextValue(): s += int(context.getInputValue()) context.emit(context.getInputKey(), str(s)) pp.runTask(pp.Factory(WordCountMapper, WordCountReducer)) </code></pre> I uploaded that program to a bucket and called it run. I then used the following conf.xml: <pre class="prettyprint"><code><?xml version="1.0"?> <configuration> <property> <name>hadoop.pipes.executable</name> <value>s3://<my bucket>/run</value> </property> <property> <name>mapred.job.name</name> <value>myjobname</value> </property> <property> <name>hadoop.pipes.java.recordreader</name> <value>true</value> </property> <property> <name>hadoop.pipes.java.recordwriter</name> <value>true</value> </property> </configuration> </code></pre> Finally, I used the following command line: <pre class="prettyprint"><code>hadoop pipes -conf conf.xml -input s3://elasticmapreduce/samples/wordcount/input -output s3://tmp.nou/asdf </code></pre>

Pydoop on Amazon EMR

1 Answers

I finally got this working. Everything happens on the master node...ssh to that node as the user hadoop

You need some packages:

sudo easy_install argparse importlib
sudo apt-get update
sudo apt-get install libboost-python-dev

To build stuff:

wget http://apache.mirrors.pair.com/hadoop/common/hadoop-0.20.205.0/hadoop-0.20.205.0.tar.gz
wget http://sourceforge.net/projects/pydoop/files/Pydoop-0.6/pydoop-0.6.0.tar.gz
tar xvf hadoop-0.20.205.0.tar.gz
tar xvf pydoop-0.6.0.tar.gz

export JAVA_HOME=/usr/lib/jvm/java-6-sun 
export JVM_ARCH=64 # I assume that 32 works for 32-bit systems
export HADOOP_HOME=/home/hadoop
export HADOOP_CPP_SRC=/home/hadoop/hadoop-0.20.205.0/src/c++/
export HADOOP_VERSION=0.20.205
export HDFS_LINK=/home/hadoop/hadoop-0.20.205.0/src/c++/libhdfs/

cd ~/hadoop-0.20.205.0/src/c++/libhdfs
sh ./configure
make
make install
cd ../install
tar cvfz ~/libhdfs.tar.gz lib
sudo tar xvf ~/libhdfs.tar.gz -C /usr

cd ~/pydoop-0.6.0
python setup.py bdist
cp dist/pydoop-0.6.0.linux-x86_64.tar.gz ~/
sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C /

Save the two tarballs and in the future, you can skip the build part and simply do the following to install (need to figure out how to do this a boostrap option for installing on multi node clusters)

sudo tar xvf ~/libhdfs.tar.gz -C /usr
sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C /

I was then able to run the example program using the Full-fledged Hadoop API (after fixing a bug in the constructor so that it calls super(WordCountMapper, self)).

#!/usr/bin/python

import pydoop.pipes as pp

class WordCountMapper(pp.Mapper):

  def __init__(self, context):
    super(WordCountMapper, self).__init__(context)
    context.setStatus("initializing")
    self.input_words = context.getCounter("WORDCOUNT", "INPUT_WORDS")

  def map(self, context):
    words = context.getInputValue().split()
    for w in words:
      context.emit(w, "1")
    context.incrementCounter(self.input_words, len(words))

class WordCountReducer(pp.Reducer):

  def reduce(self, context):
    s = 0
    while context.nextValue():
      s += int(context.getInputValue())
    context.emit(context.getInputKey(), str(s))

pp.runTask(pp.Factory(WordCountMapper, WordCountReducer))

I uploaded that program to a bucket and called it run. I then used the following conf.xml:

<?xml version="1.0"?>
<configuration>

<property>
  <name>hadoop.pipes.executable</name>
  <value>s3://<my bucket>/run</value>
</property>

<property>
  <name>mapred.job.name</name>
  <value>myjobname</value>
</property>

<property>
  <name>hadoop.pipes.java.recordreader</name>
  <value>true</value>
</property>

<property>
  <name>hadoop.pipes.java.recordwriter</name>
  <value>true</value>
</property>

</configuration>

Finally, I used the following command line:

hadoop pipes -conf conf.xml -input s3://elasticmapreduce/samples/wordcount/input -output s3://tmp.nou/asdf

131

answered Nov 01 '22 16:11

Nathan Binkert

Related questions
                            
                                How do I use Python 3.2 email module to send unicode messages encoded in utf-8 with quoted-printable?
                            
                                Any ways to add custom/debug message to details of failed test method of python/django unittest.TestCase?
                            
                                Python 3.x tkinter importing error
                            
                                Convert float32 array to datetime64 in Numpy 1.6.1
                            
                                Python: How to make Reportlab move to next page in PDF output
                            
                                Django: Extracting a `Q` object from a `QuerySet`
                            
                                Redefinition of class method in python
                            
                                How does operator binding work in this Python example?
                            
                                How to better rasterize a plot without blurring the labels in matplotlib?
                            
                                Scraping javascript-generated data using Python
                            
                                Create Python EXE without MSVCP90.dll
                            
                                Should memory usage increase when using ElementTree.iterparse() when clear()ing trees?
                            
                                Class property using Python C-API
                            
                                Is a spawned subprocess considered a new dyno on Heroku?
                            
                                Determining what tkinter window is currently on top
                            
                                how to make jenkins run a python script that executes a build?
                            
                                How to get properties of picked object in mplot3d (matplotlib + python)?
                            
                                How to draw a line in Python Mayavi?
                            
                                mod_wsgi error - class.__dict__ not accessible in restricted mode
                            
                                ForeignKey vs OneToOne field django [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pydoop on Amazon EMR

Tags:

python

amazon-web-services

hadoop

amazon-emr

jldupont

People also ask

1 Answers

Nathan Binkert

Recent Activity

Donate For Us