Editing a multi million row file on Hadoop cluster

Question

I am trying to edit a large file on Hadoop cluster and trim white spaces and special characters like ¦,*,@," etc from the file. I dont want to copyToLocal and use a sed as i have 1000's of such files to edit.

Donald Miner · Accepted Answer

MapReduce is perfect for this. Good thing you have it in HDFS!

You say you think you can solve your problem with sed. If that's the case, then Hadoop Streaming would be a good choice for a one-off.

$ hadoop jar /path/to/hadoop/hadoop-streaming.jar \
   -D mapred.reduce.tasks=0 \
   -input MyLargeFiles \
   -output outputdir \
   -mapper "sed ..."

This will fire up a MapReduce job that applies your sed command to every line in the entire file. Since there are 1000s of files, you'll have several mapper tasks hitting the files at once. The data will also go right back into the cluster.

Note that I set the number of reducers to 0 here. That's because its not really needed. If you want your output to be one file, than use one reducer, but don't specify -reducer. I think that uses the identity reducer and effectively just creates one output file with one reducer. The mapper-only version is definitely faster.

Another option, which I don't think is as good, but doesn't require MapReduce, and is still better than copyToLocal is to stream it through the node and push it back up without hitting disk. Here's an example:

$ hadoop fs -cat MyLargeFile.txt | sed '...' | hadoop fs -put - outputfile.txt

The - in hadoop fs -put tells it to take data from stdin instead of a file.

Editing a multi million row file on Hadoop cluster

Tags:

hadoop

apache-pig

Joy Jyoti

1 Answers

Donald Miner

Recent Activity

Donate For Us

Editing a multi million row file on Hadoop cluster

Tags:

hadoop

apache-pig

Joy Jyoti

1 Answers

Donald Miner

Related questions

Recent Activity

Donate For Us