Usually, I can open a new file with something like this:
aDict = {}
with open('WordLists/positive_words.txt', 'r') as f:
aDict['positive'] = {line.strip() for line in f}
with open('WordLists/negative_words.txt', 'r') as f:
aDict['negative'] = {line.strip() for line in f}
This will open up the two relevant text files in the WordLists folder and append each line to the dictionary as either positive or negative.
When I want to run a mapreduce job within Hadoop however, I don't think this works. I am running my program like so:
./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -file hadoop_map.py -mapper hadoop_reduce.py -input /toBeProcessed -output /Completed
I have tried to change the code to this:
with open('/mapreduce/WordLists/negative_words.txt', 'r')
where mapreduce is a folder on the HDFS, with WordLists a subfolder containing negative words. But my program doesn't find this. Is what I'm doing possible and if so, what is the correct way to load files on the HDFS.
Edit
I've now tried:
with open('hdfs://localhost:9000/mapreduce/WordLists/negative_words.txt', 'r')
This seems to do something, but now I get this sort of output:
13/08/27 21:18:50 INFO streaming.StreamJob: map 0% reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob: map 50% reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob: map 0% reduce 0%
Then a job fail. So still not right. Any ideas?
Edit 2:
Having re-read the API, I notice I can use the -files
option in the terminal to specify files. The API states:
The -files option creates a symlink in the current working directory of the tasks that points to the local copy of the file.
In this example, Hadoop automatically creates a symlink named testfile.txt in the current working directory of the tasks. This symlink points to the local copy of testfile.txt.
-files hdfs://host:fs_port/user/testfile.txt
Therefore, I run:
./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -files hdfs://localhost:54310/mapreduce/SentimentWordLists/positive_words.txt#positive_words -files hdfs://localhost:54310/mapreduce/SentimentWordLists/negative_words.txt#negative_words -file hadoop_map.py -mapper hadoop_map.py -input /toBeProcessed -output /Completed
From my understanding of the API, this creates symlinks so I can use "positive_words" and "negative_words" in my code, like this:
with open('negative_words.txt', 'r')
However, this still doesn't work. Any help anyone can offer would be hugely appreciated as I can't do much until I solve this.
Edit 3:
I can use this command:
-file ~/Twitter/SentimentWordLists/positive_words.txt
along with the rest of my command to run the Hadoop job. This finds the file on my local system rather than HDFS. This doesn't throw any errors, so it's accepted somewhere as a file. However, I've no idea how to access the file.
Solution after plenty comments :)
Reading a data file in python: send it with -file
and add to your script the following:
import sys
Sometimes is needed to add after the import
:
sys.path.append('.')
(related to @DrDee comment in Hadoop Streaming - Unable to find file error)
when dealing with HDFS programatically you should look into FileSystem, FileStatus, and Path. These are hadoop API classes which allow you to access HDFS within your program.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With