Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Hadoop Streaming with LZO-compressed Sequence Files?

I'm trying to play around with the Google ngrams dataset using Amazon's Elastic Map Reduce. There's a public dataset at http://aws.amazon.com/datasets/8172056142375670, and I want to use Hadoop streaming.

For the input files, it says "We store the datasets in a single object in Amazon S3. The file is in sequence file format with block level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable."

What do I need to do in order to process these input files with Hadoop Streaming?

I tried adding an extra "-inputformat SequenceFileAsTextInputFormat" to my arguments, but this doesn't seem to work -- my jobs keep failing for some unspecified reason. Are there other arguments I'm missing?

I've tried using a very simple identity as both my mapper and reducer

#!/usr/bin/env ruby

STDIN.each do |line|
  puts line
end

but this doesn't work.

like image 327
grautur Avatar asked Feb 20 '11 23:02

grautur


1 Answers

lzo is packaged as part of elastic mapreduce so there's no need to install anything.

i just tried this and it works...

 hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \
  -D mapred.reduce.tasks=0 \
  -input s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \
  -inputformat SequenceFileAsTextInputFormat \
  -output test_output \
  -mapper org.apache.hadoop.mapred.lib.IdentityMapper
like image 143
mat kelcey Avatar answered Oct 05 '22 21:10

mat kelcey