I have a few very large, gzip-compressed csv files (the compressed output of mysqldump
) -- each around 65 GB.
I need to split them up into compressed-chunks that are each less than 4 GB (after compression), keeping in mind that quoted new line characters exist in each csv file.
What would be the most efficient way to do this on a 'nix command line (e.g. in Debian)?
Similar to this SO although the responses don't correctly account for quoted newline characters.
A method without using temporary disk space.
At first, since the compression ratio varies for each row depending on its content, it is hard to get a precise after-compression size target unless you run a 2-pass encoding. However, you can get a rough estimation of how much size a chunk should be before compression by using
gzip -l file.csv.gz | tail -1 | awk '{print int(4*$2/$1)}'
The 4
(GB) in this command is your target size for each chunk after compression, so if the result shows 21
, it means that you can split the uncompressed file with roughly a chunk size of 21
(GB), assuming that the file has a uniform entropy distribution.
We can use the obtained chunk size above to split the file
gzip -dc file.csv.gz | split -C 21G -d - split_ --filter='gzip > $FILE.csv.gz'
gzip -dc
decompress the file into stdoutsplit -C 21G
puts at most 21G of records per output filesplit --filter='gzip > $FILE.csv.gz'
enables direct compression for each splitted fileBonus Tip: replace all gzip
above with pigz
to enable faster multi-thread compressing
Update
To preserve the header for each splitted gzipped file, we can use
gzip -dc file.csv.gz | tail -n+2 | split - --filter='{ gzip -dc file.csv.gz | head -1; cat; } | gzip > $FILE.csv.gz'
Some options of split
are ignored here for the sake of simplicity, but you get the idea. The trick is to modify the filter option in split
so that it prepends the header from the original csv file to the output stream for each splitted file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With