I have a for loop which runs a Python script ~100 times on 100 different input folders. The python script is most efficient on 2 cores, and I have 50 cores available. So I'd like to use GNU parallel to run the script on 25 folders at a time.
Here's my for loop (works fine, but is sequential of course), the python script takes a bunch of input variables including the -p 2 which runs it on two cores:
for folder in $(find /home/rob/PartitionFinder/ -maxdepth 2 -type d); do
python script.py --raxml --quick --no-ml-tree $folder --force -p 2
done
and here's my attempt to parallelise it, which does not work:
folders=$(find /home/rob/PartitionFinder/ -maxdepth 2 -type d)
echo $folders | parallel -P 25 python script.py --raxml --quick --no-ml-tree {} --force -p 2
The issue I'm hitting (perhaps it's just the first of many though) is that my folders variable is not a list, so it's really just passing a long string of 100 folders as the {} to the script.
All hints gratefully received.
Replace echo $folders | parallel ... with echo "$folders" | parallel ....
Without the double quotes, the shell parses spaces in $folders and passes them as separate arguments to echo, which causes them to be printed on one line. parallel provides each line as argument to the job.
To avoid such quoting issues altogether, it is always a good idea to pipe find to parallel directly, and use the null character as the delimiter:
find ... -print0 | parallel -0 ...
This will work even when encountering file names that contain multiple spaces or a newline character.
you can pipe find directly to parallel:
find /home/rob/PartitionFinder/ -maxdepth 2 -type d | parallel -P 25 python script.py --raxml --quick --no-ml-tree {} --force -p 2
If you want to keep the string in $folder, you can pipe the echo to xargs.
echo $folders | xargs -n 1 | parallel -P 25 python script.py --raxml --quick --no-ml-tree {} --force -p 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With