I have program (gawk) that outputs stream of data to its STDOUT. The data processed is literally 10s of GBs. I don't want to persist it in a single file but rather split it into chunks and potentially apply some extra processing (like compression) to each before saving. my data is a sequence of records and I don't want splitting to cut record in half. Each record matches the following regexp: <code>^\{index.+?\}\}\n\{.+?\}$</code> or for simplicity can assume that two rows (first uneven then even when numbering from beginning of the stream) always make a record. Can I: <ul> <li>use some standard linux command to split STDIN by defining preferable size of chunk? Doesn't need to be exact given record variable size can't guarantee it. Alternatively just number of records if def. by size is impossible</li> <li>compress each chunk and store in a file (with some numbering in its name like 001, 002 etc..)</li> </ul> I've became aware of commands like GNU parallel or <code>csplit</code> but don't know how to put it together. Would be nice if functionality explained above could achieved without writing custom perl script for it. This however could be another, last resort solution but again, not sure how to best implement it.

GNU Parallel can split stdin into chunks of records. This will split stdin into 50 MB chunks with each record being 2 lines. Each chunk will be passed to gzip and compressed to the name [chunk number].gz: <pre class="prettyprint"><code>cat big | parallel -l2 --pipe --block 50m gzip ">"{#}.gz </code></pre> If you know your second line will never start with '{index' you can use '{index' as the record start: <pre class="prettyprint"><code>cat big | parallel --recstart '{index' --pipe --block 50m gzip ">"{#}.gz </code></pre> You can then easily test if the splitting went correctly by: <pre class="prettyprint"><code>parallel zcat {} \| wc -l ::: *.gz </code></pre> Unless your records are all the same length you will probably see a different number of lines, but all even. Watch the intro video for a quick introduction: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1 Walk through the tutorial (man parallel_tutorial). You command line will love you for it.

You can either use <code>split</code> utility (which is shipped with <code>GNU coreutils</code> package in contrast to <code>parallel</code> therefore more chances to be found on the target system) which can read STDIN (in addition to ordinary files), use by-line or by-size thresholds and apply custom logic to chunks via <code>--filter CMD</code> option. Please refer to the corresponding man page for usage details. <pre class="prettyprint"><code>cat target | split -d -l10000 --suffix-length 5 --filter 'gzip > $FILE.gz' - prefix. </code></pre> Is going to split STDIN into gzipped chunks 10000 lines each, with name <code>prefix.<CHUNK_NUMBER></code>, where <code><CHUNK_NUMBER></code> starts from 0 and is padded with zeros to the length of 5 (e.g. <code>00000</code>, <code>00001</code>, <code>00002</code>, etc.). Start number and extra suffix can be set too.

split STDIN to multiple files (and compress them if possible)

Tags:

file

split

stdin

gnu-parallel

I have program (gawk) that outputs stream of data to its STDOUT. The data processed is literally 10s of GBs. I don't want to persist it in a single file but rather split it into chunks and potentially apply some extra processing (like compression) to each before saving.

my data is a sequence of records and I don't want splitting to cut record in half. Each record matches the following regexp:

^\{index.+?\}\}\n\{.+?\}$

or for simplicity can assume that two rows (first uneven then even when numbering from beginning of the stream) always make a record.

Can I:

use some standard linux command to split STDIN by defining preferable size of chunk? Doesn't need to be exact given record variable size can't guarantee it. Alternatively just number of records if def. by size is impossible
compress each chunk and store in a file (with some numbering in its name like 001, 002 etc..)

I've became aware of commands like GNU parallel or csplit but don't know how to put it together. Would be nice if functionality explained above could achieved without writing custom perl script for it. This however could be another, last resort solution but again, not sure how to best implement it.

539

asked Mar 25 '14 08:03

msciwoj

Video Answer

2 Answers

GNU Parallel can split stdin into chunks of records. This will split stdin into 50 MB chunks with each record being 2 lines. Each chunk will be passed to gzip and compressed to the name [chunk number].gz:

Click to copy

cat big | parallel -l2 --pipe --block 50m gzip ">"{#}.gz

If you know your second line will never start with '{index' you can use '{index' as the record start:

Click to copy

cat big | parallel --recstart '{index' --pipe --block 50m gzip ">"{#}.gz

You can then easily test if the splitting went correctly by:

Click to copy

parallel zcat {} \| wc -l ::: *.gz

Unless your records are all the same length you will probably see a different number of lines, but all even.

Watch the intro video for a quick introduction: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial (man parallel_tutorial). You command line will love you for it.

104

answered Oct 17 '22 10:10

Ole Tange

You can either use split utility (which is shipped with GNU coreutils package in contrast to parallel therefore more chances to be found on the target system) which can read STDIN (in addition to ordinary files), use by-line or by-size thresholds and apply custom logic to chunks via --filter CMD option. Please refer to the corresponding man page for usage details.

Click to copy

cat target | split -d -l10000 --suffix-length 5 --filter 'gzip > $FILE.gz' - prefix.

Is going to split STDIN into gzipped chunks 10000 lines each, with name prefix.<CHUNK_NUMBER>, where <CHUNK_NUMBER> starts from 0 and is padded with zeros to the length of 5 (e.g. 00000, 00001, 00002, etc.). Start number and extra suffix can be set too.

answered Oct 17 '22 12:10

DimG

Related questions
                            
                                Use HTTP POST to send file to IIS 7.5 virtual directory
                            
                                java - write two files atomically
                            
                                Drupal 7: how to restrict file access to specific user roles
                            
                                $_FILES array and its silly structure [duplicate]
                            
                                Intersection of files
                            
                                Encoding issues on Java 7 file names in OS X
                            
                                How to Remove FILE LOCKS? C#
                            
                                Reading mp3 from expansion file
                            
                                Execute file in ipython interpreter
                            
                                Shortest way to read Textfile to String [duplicate]
                            
                                How to open file in another directory in java?
                            
                                how to check if a specific file extension exists in a folder using powershell?
                            
                                How to scan directory with multi-thread [closed]
                            
                                Large file download using grails
                            
                                Arabic, Unicode and files in python
                            
                                PHP: Reading word by word?
                            
                                Deleting a specific line in a text file?
                            
                                Compare if two file are the same over the internet
                            
                                Handling CR line endings in Lua
                            
                                what is the purpose of require_once dirname(__FILE__) ...?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

split STDIN to multiple files (and compress them if possible)

Tags:

file

split

stdin

gnu-parallel

msciwoj

People also ask

Video Answer

2 Answers

Ole Tange

DimG

Recent Activity

Donate For Us