Does anyone know of a tool that can "crunch" the output files of Apache Hadoop into fewer files or one file. Currently I am downloading all the files to a local machine and the concatenate them in one file. So does anyone know of an API or a tool that does the same. Thanks in advance.
Limiting the number of output files means you want to limit the number of reducers. You could do that with the help of mapred.reduce.tasks
property from the Hive shell. Example :
hive> set mapred.reduce.tasks = 5;
But it might affect the performance of your query. Alternatively, you could use getmerge
command from the HDFS shell once you are done with your query. This command takes a source directory and a destination file as input and concatenates files in src into the destination local file.
Usage :
bin/hadoop fs -getmerge <src> <localdst>
HTH
See https://community.cloudera.com/t5/Support-Questions/Hive-Multiple-Small-Files/td-p/204038
set hive.merge.mapfiles=true; -- Merge small files at the end of a map-only job.
set hive.merge.mapredfiles=true; -- Merge small files at the end of a map-reduce job.
set hive.merge.size.per.task=???; -- Size (bytes) of merged files at the end of the job.
set hive.merge.smallfiles.avgsize=??? -- File size (bytes) threshold
-- When the average output file size of a job is less than this number,
-- Hive will start an additional map-reduce job to merge the output files
-- into bigger files. This is only done for map-only jobs if hive.merge.mapfiles
-- is true, and for map-reduce jobs if hive.merge.mapredfiles is true.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With