I've got a data-import script that reads lines and adds them to a database, so far so good. Unfortunately something in the script (or its runtime or database library or whatever) is memory leaky, so large imports use monotonically increasing main memory, leading to slow swap and then memory-exhausted process death. Breaking the import up into multiple runs is a workaround; I've been doing that with split
then doing a looped execution of the import script on each piece.
But I'd rather skip making the split files and this feels like it should be a 1-liner. In fact, it seems there should be an equivalent to xargs
that passes the lines through to the specified command on stdin, instead of as arguments. If this hypothetical command were xlines
, then I'd expect the following to run the myimport
script for each batch of up-to 50,000 lines in giantfile.txt:
cat giantfile.txt | xlines -L 50000 myimport
Am I missing an xlines
-like capability under some other name, or hidden in some other command's options? Or can xlines
be done in a few lines of BASH script?
Use GNU Parallel - available here.
You will need the --pipe
option and also the --block
option (which takes a byte size, rather than a line count).
Something along the lines of:
cat giantfile.txt | parallel -j 8 --pipe --block 4000000 myimport
(That's choosing a blocksize of 50,000 lines * 80 bytes = 4000000, which could also be abbreviated 4m
here.)
If you don't want the jobs to actually run in parallel, change the 8
to 1
. Or, you can leave it out altogether and it will run one job per CPU core.
You can also avoid the cat
, by running
parallel ... < giantfile.txt
My approach, without installing parallel
, and without writing temporary files:
#!/bin/bash
[ ! -f "$1" ] && echo "missing file." && exit 1
command="$(which cat)" # just as example, insert your command here
totalSize="$(wc -l $1 | cut -f 1 -d ' ')"
chunkSize=3 # just for the demo, set to 50000 in your version
offset=1
while [ $[ $totalSize + 1 ] -gt $offset ]; do
tail -n +$offset $1 | head -n $chunkSize | $command
let "offset = $offset + $chunkSize"
echo "----"
done
Test:
seq 1000 1010 > testfile.txt
./splitter.sh testfile.txt
Output:
1000
1001
1002
----
1003
1004
1005
----
1006
1007
1008
----
1009
1010
----
This way, the solution remains portable, and the performance is better than with temporary files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With