Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Editing a multi million row file on Hadoop cluster

I am trying to edit a large file on Hadoop cluster and trim white spaces and special characters like ¦,*,@," etc from the file. I dont want to copyToLocal and use a sed as i have 1000's of such files to edit.

like image 302
Joy Jyoti Avatar asked Feb 20 '14 19:02

Joy Jyoti


1 Answers

MapReduce is perfect for this. Good thing you have it in HDFS!

You say you think you can solve your problem with sed. If that's the case, then Hadoop Streaming would be a good choice for a one-off.

$ hadoop jar /path/to/hadoop/hadoop-streaming.jar \
   -D mapred.reduce.tasks=0 \
   -input MyLargeFiles \
   -output outputdir \
   -mapper "sed ..."

This will fire up a MapReduce job that applies your sed command to every line in the entire file. Since there are 1000s of files, you'll have several mapper tasks hitting the files at once. The data will also go right back into the cluster.

Note that I set the number of reducers to 0 here. That's because its not really needed. If you want your output to be one file, than use one reducer, but don't specify -reducer. I think that uses the identity reducer and effectively just creates one output file with one reducer. The mapper-only version is definitely faster.


Another option, which I don't think is as good, but doesn't require MapReduce, and is still better than copyToLocal is to stream it through the node and push it back up without hitting disk. Here's an example:

$ hadoop fs -cat MyLargeFile.txt | sed '...' | hadoop fs -put - outputfile.txt

The - in hadoop fs -put tells it to take data from stdin instead of a file.

like image 88
Donald Miner Avatar answered Sep 20 '22 14:09

Donald Miner