Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to grep large number of files?

Tags:

grep

bash

I am trying to grep 40k files in the current directory and i am getting this error.

for i in $(cat A01/genes.txt); do grep $i *.kaks; done > A01/A01.result.txt
-bash: /usr/bin/grep: Argument list too long

How do one normally grep thousands of files?

Thanks Upendra

like image 535
upendra Avatar asked May 09 '14 19:05

upendra


1 Answers

This makes David sad...

Everyone so far is wrong (except for anubhava).

Shell scripting is not like any other programming language because much of the interpretation of lines comes from the power of the shell interpolating them before the command is actually executed.

Let's take something simple:

$ set -x
$ ls
+ ls
bar.txt foo.txt fubar.log
$ echo The text files are *.txt
echo The text files are *.txt
> echo The text files are bar.txt foo.txt
The text files are bar.txt foo.txt
$ set +x
$

The set -x allows you to see how the shell actually interpolates the glob and then passes that back to the command as input. The > points to the line that is actually being executed by the command.

You can see that the echo command isn't interpreting the *. Instead, the shell grabs the * and replaces it with the names of the matching files. Then and only then does the echo command actually executes the command.

When you have 40K plus files, and you do grep *, you're expanding that * to the names of those 40,000 plus files before grep even has a chance to execute, and that's where the error message /usr/bin/grep: Argument list too long is coming from.

Fortunately, Unix has a way around this dilemma:

$ find . -name "*.kaks" -type f -maxdepth 1 | xargs grep -f A01/genes.txt

The find . -name "*.kaks" -type f -maxdepth 1 will find all of your *.kaks files, and the -depth 1 will only include files in the current directory. The -type f makes sure you only pick up files and not a directory.

The find command pipes the names of the files into xargs and xargs will append the names of the file to the grep -f A01/genes.txtcommand. However, xargs has a trick up it sleeve. It knows how long the command line buffer is, and will execute the grep when the command line buffer is full, then pass in another series of file to the grep. This way, grep gets executed maybe three or ten times (depending upon the size of the command line buffer), and all of our files are used.

Unfortunately, xargs uses whitespace as a separator for the file names. If your files contain spaces or tabs, you'll have trouble with xargs. Fortunately, there's another fix:

$ find . -name "*.kaks" -type f -maxdepth 1 -print0 | xargs -0 grep -f A01/genes.txt

The -print0 will cause find to print out the names of the files not separated by newlines, but by the NUL character. The -0 parameter for xargs tells xargs that the file separator isn't whitespace, but the NUL character. Thus, fixes the issue.

You could also do this too:

$ find . -name "*.kaks" -type f -maxdepth 1 -exec grep -f A01/genes.txt {} \;

This will execute the grep for each and every file found instead of what xargs does and only runs grep for all the files it can stuff on the command line. The advantage of this is that it avoids shell interference entirely. However, it may or may not be less efficient.

What would be interesting is to experiment and see which one is more efficient. You can use time to see:

$ time find . -name "*.kaks" -type f -maxdepth 1 -exec grep -f A01/genes.txt {} \;

This will execute the command and then tell you how long it took. Try it with the -exec and with xargs and see which is faster. Let us know what you find.

like image 168
David W. Avatar answered Nov 15 '22 21:11

David W.