I have program (gawk) that outputs stream of data to its STDOUT. The data processed is literally 10s of GBs. I don't want to persist it in a single file but rather split it into chunks and potentially apply some extra processing (like compression) to each before saving.
my data is a sequence of records and I don't want splitting to cut record in half. Each record matches the following regexp:
^\{index.+?\}\}\n\{.+?\}$
or for simplicity can assume that two rows (first uneven then even when numbering from beginning of the stream) always make a record.
Can I:
I've became aware of commands like GNU parallel or csplit
but don't know how to put it together.
Would be nice if functionality explained above could achieved without writing custom perl script for it. This however could be another, last resort solution but again, not sure how to best implement it.
To split a file into pieces, you simply use the split command. By default, the split command uses a very simple naming scheme. The file chunks will be named xaa, xab, xac, etc., and, presumably, if you break up a file that is sufficiently large, you might even get chunks named xza and xzz.
To split zip archives into multiple files, we'll use the -s (splitsize) option of the zip command. Before using the zip utility, you'll need to make sure it's installed on your system. You can check our guide on how to use zip on Linux for help with that.
GNU Parallel can split stdin into chunks of records. This will split stdin into 50 MB chunks with each record being 2 lines. Each chunk will be passed to gzip and compressed to the name [chunk number].gz:
cat big | parallel -l2 --pipe --block 50m gzip ">"{#}.gz
If you know your second line will never start with '{index' you can use '{index' as the record start:
cat big | parallel --recstart '{index' --pipe --block 50m gzip ">"{#}.gz
You can then easily test if the splitting went correctly by:
parallel zcat {} \| wc -l ::: *.gz
Unless your records are all the same length you will probably see a different number of lines, but all even.
Watch the intro video for a quick introduction: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (man parallel_tutorial). You command line will love you for it.
You can either use split
utility (which is shipped with GNU coreutils
package in contrast to parallel
therefore more chances to be found on the target system) which can read STDIN (in addition to ordinary files), use by-line or by-size thresholds and apply custom logic to chunks via --filter CMD
option. Please refer to the corresponding man page for usage details.
cat target | split -d -l10000 --suffix-length 5 --filter 'gzip > $FILE.gz' - prefix.
Is going to split STDIN into gzipped chunks 10000 lines each, with name prefix.<CHUNK_NUMBER>
, where <CHUNK_NUMBER>
starts from 0 and is padded with zeros to the length of 5 (e.g. 00000
, 00001
, 00002
, etc.). Start number and extra suffix can be set too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With