Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recommended way to batch stdin lines to another repeated command, like xargs but via stdin rather than arguments?

I've got a data-import script that reads lines and adds them to a database, so far so good. Unfortunately something in the script (or its runtime or database library or whatever) is memory leaky, so large imports use monotonically increasing main memory, leading to slow swap and then memory-exhausted process death. Breaking the import up into multiple runs is a workaround; I've been doing that with split then doing a looped execution of the import script on each piece.

But I'd rather skip making the split files and this feels like it should be a 1-liner. In fact, it seems there should be an equivalent to xargs that passes the lines through to the specified command on stdin, instead of as arguments. If this hypothetical command were xlines, then I'd expect the following to run the myimport script for each batch of up-to 50,000 lines in giantfile.txt:

cat giantfile.txt | xlines -L 50000 myimport

Am I missing an xlines-like capability under some other name, or hidden in some other command's options? Or can xlines be done in a few lines of BASH script?

like image 384
gojomo Avatar asked Jan 10 '23 20:01

gojomo


2 Answers

Use GNU Parallel - available here.

You will need the --pipe option and also the --block option (which takes a byte size, rather than a line count).

Something along the lines of:

cat giantfile.txt | parallel -j 8 --pipe --block 4000000 myimport

(That's choosing a blocksize of 50,000 lines * 80 bytes = 4000000, which could also be abbreviated 4m here.)

If you don't want the jobs to actually run in parallel, change the 8 to 1. Or, you can leave it out altogether and it will run one job per CPU core.

You can also avoid the cat, by running

parallel ... < giantfile.txt
like image 170
Mark Setchell Avatar answered Jan 12 '23 09:01

Mark Setchell


My approach, without installing parallel, and without writing temporary files:

#!/bin/bash

[ ! -f "$1" ] && echo "missing file." && exit 1

command="$(which cat)" # just as example, insert your command here
totalSize="$(wc -l $1 | cut -f 1 -d ' ')"
chunkSize=3 # just for the demo, set to 50000 in your version
offset=1

while [ $[ $totalSize + 1 ] -gt $offset ]; do

        tail -n +$offset $1 | head -n $chunkSize | $command
        let "offset = $offset + $chunkSize"
        echo "----"
done

Test:

seq 1000 1010 > testfile.txt
./splitter.sh testfile.txt

Output:

1000
1001
1002
----
1003
1004
1005
----
1006
1007
1008
----
1009
1010
----

This way, the solution remains portable, and the performance is better than with temporary files.

like image 34
lxg Avatar answered Jan 12 '23 09:01

lxg