Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to merge multiple parquet files to single parquet file using linux or hdfs command?

Tags:

hdfs

parquet

I have multiple small parquet files generated as output of hive ql job, i would like to merge the output files to single parquet file?

what is the best way to do it using some hdfs or linux commands?

we used to merge the text files using cat command, but will this work for parquet as well? Can we do it using HiveQL itself when writing output files like how we do it using repartition or coalesc method in spark?

like image 434
Shankar Avatar asked Jul 27 '16 10:07

Shankar


1 Answers

According to this https://issues.apache.org/jira/browse/PARQUET-460 Now you can download the source code and compile parquet-tools which is built in merge command.

java -jar ./target/parquet-tools-1.8.2-SNAPSHOT.jar merge /input_directory/
        /output_idr/file_name

Or using a tool like https://github.com/stripe/herringbone

like image 108
giaosudau Avatar answered Sep 20 '22 06:09

giaosudau