Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Opening files on HDFS from Hadoop mapreduce job

Usually, I can open a new file with something like this:

aDict = {}
with open('WordLists/positive_words.txt', 'r') as f:
    aDict['positive'] = {line.strip() for line in f}

with open('WordLists/negative_words.txt', 'r') as f:
    aDict['negative'] = {line.strip() for line in f}

This will open up the two relevant text files in the WordLists folder and append each line to the dictionary as either positive or negative.

When I want to run a mapreduce job within Hadoop however, I don't think this works. I am running my program like so:

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -file hadoop_map.py -mapper hadoop_reduce.py -input /toBeProcessed -output /Completed

I have tried to change the code to this:

with open('/mapreduce/WordLists/negative_words.txt', 'r')

where mapreduce is a folder on the HDFS, with WordLists a subfolder containing negative words. But my program doesn't find this. Is what I'm doing possible and if so, what is the correct way to load files on the HDFS.

Edit

I've now tried:

with open('hdfs://localhost:9000/mapreduce/WordLists/negative_words.txt', 'r')

This seems to do something, but now I get this sort of output:

13/08/27 21:18:50 INFO streaming.StreamJob:  map 0%  reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob:  map 50%  reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob:  map 0%  reduce 0%

Then a job fail. So still not right. Any ideas?

Edit 2:

Having re-read the API, I notice I can use the -files option in the terminal to specify files. The API states:

The -files option creates a symlink in the current working directory of the tasks that points to the local copy of the file.

In this example, Hadoop automatically creates a symlink named testfile.txt in the current working directory of the tasks. This symlink points to the local copy of testfile.txt.

-files hdfs://host:fs_port/user/testfile.txt

Therefore, I run:

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -files hdfs://localhost:54310/mapreduce/SentimentWordLists/positive_words.txt#positive_words -files hdfs://localhost:54310/mapreduce/SentimentWordLists/negative_words.txt#negative_words -file hadoop_map.py -mapper hadoop_map.py -input /toBeProcessed -output /Completed

From my understanding of the API, this creates symlinks so I can use "positive_words" and "negative_words" in my code, like this:

with open('negative_words.txt', 'r')

However, this still doesn't work. Any help anyone can offer would be hugely appreciated as I can't do much until I solve this.

Edit 3:

I can use this command:

-file ~/Twitter/SentimentWordLists/positive_words.txt

along with the rest of my command to run the Hadoop job. This finds the file on my local system rather than HDFS. This doesn't throw any errors, so it's accepted somewhere as a file. However, I've no idea how to access the file.

like image 913
Andrew Martin Avatar asked Nov 01 '22 15:11

Andrew Martin


2 Answers

Solution after plenty comments :)

Reading a data file in python: send it with -file and add to your script the following:

import sys

Sometimes is needed to add after the import:

sys.path.append('.')

(related to @DrDee comment in Hadoop Streaming - Unable to find file error)

like image 196
Alfonso Nishikawa Avatar answered Nov 07 '22 12:11

Alfonso Nishikawa


when dealing with HDFS programatically you should look into FileSystem, FileStatus, and Path. These are hadoop API classes which allow you to access HDFS within your program.

like image 20
Daniel Imberman Avatar answered Nov 07 '22 11:11

Daniel Imberman