Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

split STDIN to multiple files (and compress them if possible)

I have program (gawk) that outputs stream of data to its STDOUT. The data processed is literally 10s of GBs. I don't want to persist it in a single file but rather split it into chunks and potentially apply some extra processing (like compression) to each before saving.

my data is a sequence of records and I don't want splitting to cut record in half. Each record matches the following regexp:

^\{index.+?\}\}\n\{.+?\}$

or for simplicity can assume that two rows (first uneven then even when numbering from beginning of the stream) always make a record.

Can I:

  • use some standard linux command to split STDIN by defining preferable size of chunk? Doesn't need to be exact given record variable size can't guarantee it. Alternatively just number of records if def. by size is impossible
  • compress each chunk and store in a file (with some numbering in its name like 001, 002 etc..)

I've became aware of commands like GNU parallel or csplit but don't know how to put it together. Would be nice if functionality explained above could achieved without writing custom perl script for it. This however could be another, last resort solution but again, not sure how to best implement it.

like image 539
msciwoj Avatar asked Mar 25 '14 08:03

msciwoj


People also ask

How do I split a file into multiple files?

To split a file into pieces, you simply use the split command. By default, the split command uses a very simple naming scheme. The file chunks will be named xaa, xab, xac, etc., and, presumably, if you break up a file that is sufficiently large, you might even get chunks named xza and xzz.

How do I split a zip file into multiple files in Linux?

To split zip archives into multiple files, we'll use the -s (splitsize) option of the zip command. Before using the zip utility, you'll need to make sure it's installed on your system. You can check our guide on how to use zip on Linux for help with that.


Video Answer


2 Answers

GNU Parallel can split stdin into chunks of records. This will split stdin into 50 MB chunks with each record being 2 lines. Each chunk will be passed to gzip and compressed to the name [chunk number].gz:

cat big | parallel -l2 --pipe --block 50m gzip ">"{#}.gz

If you know your second line will never start with '{index' you can use '{index' as the record start:

cat big | parallel --recstart '{index' --pipe --block 50m gzip ">"{#}.gz

You can then easily test if the splitting went correctly by:

parallel zcat {} \| wc -l ::: *.gz

Unless your records are all the same length you will probably see a different number of lines, but all even.

Watch the intro video for a quick introduction: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial (man parallel_tutorial). You command line will love you for it.

like image 104
Ole Tange Avatar answered Oct 17 '22 10:10

Ole Tange


You can either use split utility (which is shipped with GNU coreutils package in contrast to parallel therefore more chances to be found on the target system) which can read STDIN (in addition to ordinary files), use by-line or by-size thresholds and apply custom logic to chunks via --filter CMD option. Please refer to the corresponding man page for usage details.

cat target | split -d -l10000 --suffix-length 5 --filter 'gzip > $FILE.gz' - prefix.

Is going to split STDIN into gzipped chunks 10000 lines each, with name prefix.<CHUNK_NUMBER>, where <CHUNK_NUMBER> starts from 0 and is padded with zeros to the length of 5 (e.g. 00000, 00001, 00002, etc.). Start number and extra suffix can be set too.

like image 26
DimG Avatar answered Oct 17 '22 12:10

DimG