Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Very slow loop using grep or fgrep on large datasets

Tags:

grep

bash

loops

I’m trying to do something pretty simple; grep from a list, an exact match for the string, on the files in a directory:

#try grep each line from the files
for i in $(cat /data/datafile); do 
LOOK=$(echo $i);
fgrep -r $LOOK /data/filestosearch >>/data/output.txt
done

The file with the matches to grep with has 20 million lines, and the directory has ~600 files, with a total of ~40Million lines I can see that this is going to be slow but we estimated it will take 7 years. Even if I use 300 cores on our HPC splitting the job by files to search, it looks like it could take over a week.

there are similar questions:

Loop Running VERY Slow :

Very slow foreach loop

here and although they are on different platforms, I think possibly if else might help me. or fgrep which is potentially faster (but seems to be a bit slow as I'm testing it now) Can anyone see a faster way to do this? Thank you in advance

like image 714
jksl Avatar asked Jan 03 '13 16:01

jksl


4 Answers

sounds like the -f flag for grep would be suitable here:

-f FILE, --file=FILE
    Obtain  patterns  from  FILE,  one  per  line.   The  empty file
    contains zero patterns, and therefore matches nothing.   (-f  is
    specified by POSIX.)

so grep can already do what your loop is doing, and you can replace the loop with:

grep -F -r -f /data/datafile /data/filestosearch >>/data/output.txt

Now I'm not sure about the performance of 20 million patterns, but at least you aren't starting 20 million processes this way so it's probably significantly faster.

like image 188
Martin Avatar answered Oct 07 '22 07:10

Martin


As Martin has already said in his answer, you should use the -f option instead of looping. I think it should be faster than looping.

Also, this looks like an excellent use case for GNU parallel. Check out this answer for usage examples. It looks difficult, but is actually quite easy to set up and run.

Other than that, 40 million lines should not be a very big deal for grep if there was only one string to match. It should be able to do it in a minute or two on any decent machine. I tested that 2 million lines takes 6 s on my laptop. So 40 mil lines should take 2 mins.

The problem is with the fact that there are 20 million strings to be matched. I think it must be running out of memory or something, especially when you run multiple instances of it on different directories. Can you try splitting the input match-list file? Try splitting it into chunks of 100000 words each for example.

EDIT: Just tried parallel on my machine. It is amazing. It automatically takes care of splitting the grep on to several cores and several machines.

like image 32
Hari Menon Avatar answered Oct 07 '22 05:10

Hari Menon


Here's one way to speed things up:

while read i
do
    LOOK=$(echo $i)
    fgrep -r $LOOK /deta/filetosearch >> /data/output.txt
done < /data/datafile

When you do that for i in $(cat /data/datafile), you first spawn another process, but that process must cat out all of those lines before running the rest of the script. Plus, there's a good possibility that you'll overload the command line and lose some of the files on the end.

By using q while read loop and redirecting the input from /data/datafile, you eliminate the need to spawn a shell. Plus, your script will immediately start reading through the while loop without first having to cat out the entire /data/datafile.

If $i are a list of directories, and you are interested in the files underneath, I wonder if find might be a bit faster than fgrep -r.

while read i do LOOK=$(echo $i) find $i -type f | xargs fgrep $LOOK >> /data/output.txt done < /data/datafile

The xargs will take the output of find, and run as many files as possible under a single fgrep. The xargs can be dangerous if file names in those directories contain whitespace or other strange characters. You can try (depending upon the system), something like this:

find $i -type f -print0 | xargs --null fgrep $LOOK >> /data/output.txt

On the Mac it's

find $i -type f -print0 | xargs -0 fgrep $LOOK >> /data/output.txt

As others have stated, if you have the GNU version of grep, you can give it the -f flag and include your /data/datafile. Then, you can completely eliminate the loop.

Another possibility is to switch to Perl or Python which actually will run faster than the shell will, and give you a bit more flexibility.

like image 39
David W. Avatar answered Oct 07 '22 06:10

David W.


Since you are searching for simple strings (and not regexp) you may want to use comm:

comm -12 <(sort find_this) <(sort in_this.*) > /data/output.txt

It takes up very little memory, whereas grep -f find_this can gobble up 100 times the size of 'find_this'.

On a 8 core this takes 100 sec on these files:

$ wc find_this; cat in_this.* | wc
3637371   4877980 307366868 find_this
16000000 20000000 1025893685

Be sure to have a reasonably new version of sort. It should support --parallel.

like image 1
Ole Tange Avatar answered Oct 07 '22 07:10

Ole Tange