Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shell: find files in a list under a directory

Tags:

linux

bash

shell

I have a list containing about 1000 file names to search under a directory and its subdirectories. There are hundreds of subdirs with more than 1,000,000 files. The following command will run find for 1000 times:

cat filelist.txt | while read f; do find /dir -name $f; done

Is there a much faster way to do it?

like image 694
Dagang Avatar asked Mar 31 '12 05:03

Dagang


2 Answers

If filelist.txt has a single filename per line:

find /dir | grep -f <(sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt)

(The -f option means that grep searches for all the patterns in the given file.)

Explanation of <(sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt):

The <( ... ) is called a process subsitution, and is a little similar to $( ... ). The situation is equivalent to (but using the process substitution is neater and possibly a little faster):

sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt > processed_filelist.txt
find /dir | grep -f processed_filelist.txt

The call to sed runs the commands s@^@/@, s/$/$/ and s/\([\.[\*]\|\]\)/\\\1/g on each line of filelist.txt and prints them out. These commands convert the filenames into a format that will work better with grep.

  • s@^@/@ means put a / at the before each filename. (The ^ means "start of line" in a regex)
  • s/$/$/ means put a $ at the end of each filename. (The first $ means "end of line", the second is just a literal $ which is then interpreted by grep to mean "end of line").

The combination of these two rules means that grep will only look for matches like .../<filename>, so that a.txt doesn't match ./a.txt.backup or ./abba.txt.

s/\([\.[\*]\|\]\)/\\\1/g puts a \ before each occurrence of . [ ] or *. Grep uses regexes and those characters are considered special, but we want them to be plain so we need to escape them (if we didn't escape them, then a file name like a.txt would match files like abtxt).

As an example:

$ cat filelist.txt
file1.txt
file2.txt
blah[2012].txt
blah[2011].txt
lastfile

$ sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt
/file1\.txt$
/file2\.txt$
/blah\[2012\]\.txt$
/blah\[2011\]\.txt$
/lastfile$

Grep then uses each line of that output as a pattern when it is searching the output of find.

like image 175
huon Avatar answered Oct 14 '22 17:10

huon


If filelist.txt is a plain list:

$ find /dir | grep -F -f filelist.txt

If filelist.txt is a pattern list:

$ find /dir | grep -f filelist.txt
like image 20
jhoran Avatar answered Oct 14 '22 18:10

jhoran