I am trying to edit a large file on Hadoop cluster and trim white spaces and special characters like ¦,*,@," etc from the file. I dont want to copyToLocal and use a sed as i have 1000's of such files to edit.
MapReduce is perfect for this. Good thing you have it in HDFS!
You say you think you can solve your problem with sed
. If that's the case, then Hadoop Streaming would be a good choice for a one-off.
$ hadoop jar /path/to/hadoop/hadoop-streaming.jar \
-D mapred.reduce.tasks=0 \
-input MyLargeFiles \
-output outputdir \
-mapper "sed ..."
This will fire up a MapReduce job that applies your sed
command to every line in the entire file. Since there are 1000s of files, you'll have several mapper tasks hitting the files at once. The data will also go right back into the cluster.
Note that I set the number of reducers to 0 here. That's because its not really needed. If you want your output to be one file, than use one reducer, but don't specify -reducer
. I think that uses the identity reducer and effectively just creates one output file with one reducer. The mapper-only version is definitely faster.
Another option, which I don't think is as good, but doesn't require MapReduce, and is still better than copyToLocal is to stream it through the node and push it back up without hitting disk. Here's an example:
$ hadoop fs -cat MyLargeFile.txt | sed '...' | hadoop fs -put - outputfile.txt
The -
in hadoop fs -put
tells it to take data from stdin instead of a file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With