Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I grep in parallel

Tags:

linux

grep

I usually use grep -rIn pattern_str big_source_code_dir to find some thing. but the grep is not parallel, how do I make it parallel? My system has 4 cores, if the grep can use all the cores, it would be faster.

like image 273
Lai Jiangshan Avatar asked Aug 10 '12 09:08

Lai Jiangshan


2 Answers

The GNU parallel command is really useful for this.

sudo apt-get install parallel # if not available on debian based systems

Then, paralell man page provides an example:

EXAMPLE: Parallel grep
       grep -r greps recursively through directories. 
       On multicore CPUs GNU parallel can often speed this up.

       find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}

       This will run 1.5 job per core, and give 1000 arguments to grep.

In your case it could be:

find big_source_code_dir -type f | parallel -k -j150% -n 1000 -m grep -H -n pattern_str {}

Finally, the GNU parallel man page also provides a section describing differences betwenn xargs and parallel command, that should help understanding why parallel seems better in your case

DIFFERENCES BETWEEN xargs AND GNU Parallel
       xargs offer some of the same possibilities as GNU parallel.

       xargs deals badly with special characters (such as space, ' and "). To see the problem try this:

         touch important_file
         touch 'not important_file'
         ls not* | xargs rm
         mkdir -p "My brother's 12\" records"
         ls | xargs rmdir

       You can specify -0 or -d "\n", but many input generators are not optimized for using NUL as separator but are optimized for newline as separator. E.g head, tail, awk, ls, echo, sed, tar -v, perl (-0 and \0 instead of \n),
       locate (requires using -0), find (requires using -print0), grep (requires user to use -z or -Z), sort (requires using -z).

       So GNU parallel's newline separation can be emulated with:

       cat | xargs -d "\n" -n1 command

       xargs can run a given number of jobs in parallel, but has no support for running number-of-cpu-cores jobs in parallel.

       xargs has no support for grouping the output, therefore output may run together, e.g. the first half of a line is from one process and the last half of the line is from another process. The example Parallel grep cannot be
       done reliably with xargs because of this.
       ...
like image 50
MordicusEtCubitus Avatar answered Nov 11 '22 02:11

MordicusEtCubitus


There will not be speed improvement if you are using a HDD to store that directory you are searching in. Hard drives are pretty much single-threaded access units.

But if you really want to do parallel grep, then this website gives two hints of how to do it with find and xargs. E.g.

find . -type f -print0 | xargs -0 -P 4 -n 40 grep -i foobar
like image 44
Ilya Avatar answered Nov 11 '22 02:11

Ilya