file.xml
is a large 74G file, I have to grep a single regular expression against it as fast as possible. I'm trying to do this by using GNU parallel
:
parallel --pipe --block 10M --ungroup LC_ALL=C grep -iF "test.*pattern" < file.xml
How can I implement this by using --pipepart
since it's faster than --pipe
?
Does it get faster by increasing or decreasing size of blocks (example 20M instead of 10M, or 10M instead of 20M)?
1.) The largest xml file I have is 11G so YMMV but using parallel --pipepart LC_ALL=C grep -H -n 'searchterm' {} :::: file.xml
was faster than parallel --pipe --block 10M --ungroup LC_ALL=C grep -iF "test.*pattern" < file.xml
and significantly faster than grep "searchterm" file.xml
.
2.) I didn't specify a block size for the parallel --pipepart
command above, but you can with the --block
option; you'll need to try different block sizes yourself to see whether they speed up / slow down the search. Using --block -1
provided the fastest speed on my system for this approach.
As @tshiono mentioned in the comments, try ripgrep - this was fastest on my test xml file (quicker than grep/parallel grep/anything else) and may prove to be a better solution for you overall.
EDIT
I tested @Ole Tange's suggested 'parallel + ripgrep' approach (parallel --pipepart --block -1 LC_ALL=C rg 'Glu299SerfsTer21' {} :::: ClinVarFullRelease_00-latest.xml
) and it was ~the same as rg 'Glu299SerfsTer21' ClinVarFullRelease_00-latest.xml
on my system. The difference was negligible, so the 'parallel + rg' approach may be best for a very large XML file. There are a number of potential reasons I didn't see the expected speedup, eg @Gordon Davisson suggestions in his comment above, but you would need to conduct comprehensive benchmarking with your own system to figure out the best approach.
(Thanks Ole Tange for the suggestion and for creating such kick ass software)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With