Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge sort gzipped files

Tags:

linux

bash

unix

I have 40 files of 2GB each, stored on an NFS architecture. Each file contains two columns: a numeric id and a text field. Each file is already sorted and gzipped.

How can I merge all of these files so that the resulting output is also sorted?

I know sort -m -k 1 should do the trick for uncompressed files, but I don't know how to do it directly with the compressed ones.

PS: I don't want the simple solution of uncompressing the files into disk, merging them, and compressing again, as I don't have sufficient disk space for that.

like image 210
mossaab Avatar asked Jul 03 '14 20:07

mossaab


People also ask

Can you concatenate Gzipped files?

Concatenation of Gzip Files We can use some common commands like cat and tar to concatenate Gzip files in the Linux system.

What is a Gzipped file?

GZIP, short for GNU Zip, is a compression/decompression format developed as part of a larger project to create a free software alternative to UNIX in the 1980s. This open source compression format does not support archiving, so it is used to compress single files. GZIP produces zipped files with the . gz extension.

How do I gzip a GZ file?

Gzip (GNU zip) is a compressing tool, which is used to truncate the file size. By default original file will be replaced by the compressed file ending with extension (. gz). To decompress a file you can use gunzip command and your original file will be back.

How do I compress multiple files using gzip in Linux?

If you want to compress multiple files or directory into one file, first you need to create a Tar archive and then compress the . tar file with Gzip. A file that ends in . tar.


1 Answers

This is a use case for process substitution. Say you have two files to sort, sorta.gz and sortb.gz. You can give the output of gunzip -c FILE.gz to sort for both of these files using the <(...) shell operator:

sort -m -k1 <(gunzip -c sorta.gz) <(gunzip -c sortb.gz) >sorted

Process substitution substitutes a command with a file name that represents the output of that command, and is typically implemented with either a named pipe or a /dev/fd/... special file.

For 40 files, you will want to create the command with that many process substitutions dynamically, and use eval to execute it:

cmd="sort -m -k1 "
for input in file1.gz file2.gz file3.gz ...; do
    cmd="$cmd <(gunzip -c '$input')"
done
eval "$cmd" >sorted       # or eval "$cmd" | gzip -c > sorted.gz
like image 186
user4815162342 Avatar answered Sep 28 '22 06:09

user4815162342