Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get s3distcp to merge with newlines

I have many millions of small one line s3 files that I'm looking to merge together. I have the s3distcp syntax down, however, I've discovered that after merging the files no newlines are contained in the merged set.

I was wondering if s3distcp includes any option to force a newline in, or is there another way to accomplish this without modifying the source files directly (or copying them and doing the same)

like image 875
isueightynine Avatar asked Oct 30 '22 23:10

isueightynine


1 Answers

If your text files begin/end with a unique sequence of characters, you can first merge them into a single file with s3distcp (I did this by by setting --targetSize to a very large number), then use sed with Hadoop streaming to add in the new lines; in the following example, each file contains a single json (the filenames all begin with 0), and the sed command inserts a newline between each instance of }{:

hadoop fs -mkdir hdfs:///tmpoutputfolder/
hadoop fs -mkdir hdfs:///finaloutputfolder/
hadoop jar lib/emr-s3distcp-1.0.jar \
               --src s3://inputfolder \
               --dest hdfs:///tmpoutputfolder \
               --targetSize 1000000000 \
               --groupBy ".*(0).*"
hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar \
               -D mapred.reduce.tasks=1 \
               --input hdfs:///tmpoutputfolder \
               --output hdfs:///finaloutputfolder \
               --mapper /bin/cat \
               --reducer '/bin/sed "s/}{/}\n{/g"'
like image 82
maxymoo Avatar answered Nov 15 '22 06:11

maxymoo