I want to read a list from a file in my hadoop streaming job. Here is my simple mapper.py:
#!/usr/bin/env python
import sys
import json
def read_file():
id_list = []
#read ids from a file
f = open('../user_ids','r')
for line in f:
line = line.strip()
id_list.append(line)
return id_list
if __name__ == '__main__':
id_list = set(read_file())
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
line = json.loads(line)
user_id = line['user']['id']
if str(user_id) in id_list:
print '%s\t%s' % (user_id, line)
and here is my reducer.py
#!/usr/bin/env python
from operator import itemgetter
import sys
current_id = None
current_list = []
id = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
id, line = line.split('\t', 1)
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_id == id:
current_list.append(line)
else:
if current_id:
# write result to STDOUT
print '%s\t%s' % (current_id, current_list)
current_id = id
current_list = [line]
# do not forget to output the last word if needed!
if current_id == id:
print '%s\t%s' % (current_id, current_list)
now to run it I say:
hadoop jar contrib/streaming/hadoop-streaming-1.1.1.jar -file ./mapper.py \
-mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py \
-input test/input.txt -output test/output -file '../user_ids'
The job starts to run:
13/11/07 05:04:52 INFO streaming.StreamJob: map 0% reduce 0%
13/11/07 05:05:21 INFO streaming.StreamJob: map 100% reduce 100%
13/11/07 05:05:21 INFO streaming.StreamJob: To kill this job, run:
I get the error:
job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201309172143_1390_m_000001
13/11/07 05:05:21 INFO streaming.StreamJob: killJob...
I when I do not read the ids from the file ../user_ids it does not give me any errors. I think the problem is it can not find my ../user_id file. I also have used the location in hdfs and still did not work. Thanks for your help.
hadoop jar contrib/streaming/hadoop-streaming-1.1.1.jar -file ./mapper.py \
-mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py \
-input test/input.txt -output test/output -file '../user_ids'
Does ../user_ids exist on your local file path when you execute the job? If it does then you need to amend your mapper code to account for the fact that this file will be available in the local working directory of the mapper at runtime:
f = open('user_ids','r')
Try giving full path of the file or While executing hadoop command make sure you are in the same directory in which the file user_ids file is present
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With