I have 40 files of 2GB each, stored on an NFS architecture. Each file contains two columns: a numeric id and a text field. Each file is already sorted and gzipped.
How can I merge all of these files so that the resulting output is also sorted?
I know sort -m -k 1
should do the trick for uncompressed files, but I don't know how to do it directly with the compressed ones.
PS: I don't want the simple solution of uncompressing the files into disk, merging them, and compressing again, as I don't have sufficient disk space for that.
Concatenation of Gzip Files We can use some common commands like cat and tar to concatenate Gzip files in the Linux system.
GZIP, short for GNU Zip, is a compression/decompression format developed as part of a larger project to create a free software alternative to UNIX in the 1980s. This open source compression format does not support archiving, so it is used to compress single files. GZIP produces zipped files with the . gz extension.
Gzip (GNU zip) is a compressing tool, which is used to truncate the file size. By default original file will be replaced by the compressed file ending with extension (. gz). To decompress a file you can use gunzip command and your original file will be back.
If you want to compress multiple files or directory into one file, first you need to create a Tar archive and then compress the . tar file with Gzip. A file that ends in . tar.
This is a use case for process substitution. Say you have two files to sort, sorta.gz
and sortb.gz
. You can give the output of gunzip -c FILE.gz
to sort for both of these files using the <(...)
shell operator:
sort -m -k1 <(gunzip -c sorta.gz) <(gunzip -c sortb.gz) >sorted
Process substitution substitutes a command with a file name that represents the output of that command, and is typically implemented with either a named pipe or a /dev/fd/...
special file.
For 40 files, you will want to create the command with that many process substitutions dynamically, and use eval
to execute it:
cmd="sort -m -k1 "
for input in file1.gz file2.gz file3.gz ...; do
cmd="$cmd <(gunzip -c '$input')"
done
eval "$cmd" >sorted # or eval "$cmd" | gzip -c > sorted.gz
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With