I'm loading a pretty gigantic file to a postgresql database. To do this I first use split
in the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Parallel
and psql copy
.
The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell split
to print the file name to std output each time it finishes writing a file so I can pipe it to Parallel
and it starts loading the files at the time split
finish writing it. Something like this:
split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}
I have read the split
man pages and I can't find anything. Is there a way to do this with split
or any other tool?
You could let parallel do the splitting:
<2011.psv parallel --pipe -N 50000000 ./carga_postgres.sh
Note, that the manpage recommends using --block
over -N
, this will still split the input at record separators, \n
by default, e.g.:
<2011.psv parallel --pipe --block 250M ./carga_postgres.sh
--pipe
and -N
Here's a test that splits a sequence of 100 numbers into 5 files:
seq 100 | parallel --pipe -N23 'cat > /tmp/parallel_test_{#}'
Check result:
wc -l /tmp/parallel_test_[1-5]
Output:
23 /tmp/parallel_test_1
23 /tmp/parallel_test_2
23 /tmp/parallel_test_3
23 /tmp/parallel_test_4
8 /tmp/parallel_test_5
100 total
If you use GNU split
, you can do this with the --filter
option
‘--filter=command’
With this option, rather than simply writing to each output file, write through a pipe to the specified shell command for each output file. command should use the $FILE environment variable, which is set to a different output file name for each invocation of the command.
You can create a shell script, which creates a file and start carga_postgres.sh at the end in the background
#! /bin/sh
cat >$FILE
./carga_postgres.sh $FILE &
and use that script as the filter
split -l 50000000 --filter=./filter.sh 2011.psv
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With