Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

most efficient way to split a compressed csv into chunks

I have a few very large, gzip-compressed csv files (the compressed output of mysqldump) -- each around 65 GB.

I need to split them up into compressed-chunks that are each less than 4 GB (after compression), keeping in mind that quoted new line characters exist in each csv file.

What would be the most efficient way to do this on a 'nix command line (e.g. in Debian)?

Similar to this SO although the responses don't correctly account for quoted newline characters.

like image 739
saladi Avatar asked Jan 30 '23 07:01

saladi


1 Answers

A method without using temporary disk space.

Chunk Size Estimation

At first, since the compression ratio varies for each row depending on its content, it is hard to get a precise after-compression size target unless you run a 2-pass encoding. However, you can get a rough estimation of how much size a chunk should be before compression by using

gzip -l file.csv.gz | tail -1 | awk '{print int(4*$2/$1)}'

The 4(GB) in this command is your target size for each chunk after compression, so if the result shows 21, it means that you can split the uncompressed file with roughly a chunk size of 21(GB), assuming that the file has a uniform entropy distribution.

Decompress, Split and Compress

We can use the obtained chunk size above to split the file

gzip -dc file.csv.gz | split -C 21G -d - split_ --filter='gzip > $FILE.csv.gz'
  • gzip -dc decompress the file into stdout
  • split -C 21G puts at most 21G of records per output file
  • split --filter='gzip > $FILE.csv.gz' enables direct compression for each splitted file

Bonus Tip: replace all gzip above with pigz to enable faster multi-thread compressing

Update

To preserve the header for each splitted gzipped file, we can use

gzip -dc file.csv.gz | tail -n+2 | split - --filter='{ gzip -dc file.csv.gz | head -1; cat; } | gzip > $FILE.csv.gz'

Some options of split are ignored here for the sake of simplicity, but you get the idea. The trick is to modify the filter option in split so that it prepends the header from the original csv file to the output stream for each splitted file.

like image 160
etopylight Avatar answered Feb 24 '23 11:02

etopylight