Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grep not performing very well on large files, is there an alternative?

Tags:

grep

sed

awk

perl

I have a diff that essentially equates to either additional unique lines or to lines that have moved around in the file, and thus their line numbers have changed. To identify what is truly a new addition, I run this little perl snippet to separate the 'resolved' lines from the 'unresolved' lines:

perl -n -e'
    /^\-([^\-].*?)\([^,\(]+,\d+,\d+\).*$/ && do { print STDOUT "$1\n"; next; };
    /^\+([^\+].*?)\([^,\(]+,\d+,\d+\).*$/ && do { print STDERR "$1\n"; next; };
' "$delta" 1>resolved 2>unresolved

This is quite quick in fact and does the job, separating a 6000+ line diff into two 3000+ line files, removing any references to line numbers and unified diff decoration. Next comes the grep command that seems to run at 100% CPU for nearly 9 minutes (real):

grep -v -f resolved unresolved

This is essentially removing all resolved lines from the unresolved file. The output, after 9 minutes, is coincidentally 9 lines of output - the unique additions or unresolved lines.

Firstly, when I have used grep in the past, it's been pretty good at this, so why in this case is it being exceptionally slow and CPU hungry?

Secondly, is there a more efficient alternative way of removing lines from one file that are contained within another?

like image 313
Craig Avatar asked Nov 05 '14 20:11

Craig


2 Answers

If the lines to be matched across the two files are supposed to be exact matches, you can use sort and uniq to do the job:

cat resolved resolved unresolved | sort | uniq -u

The only non-duplicated lines in the pipeline above will be lines in unresolved that are not in resolved. Note that it's important to specify resolved twice in the cat command: otherwise the uniq will also pick out lines unique to that file. This assumes that resolved and unresolved didn't have duplicated lines to begin with. But that's pretty easy to deal with: just sort and uniq them first

sort resolved | uniq > resolved.uniq
sort unresolved | uniq > unresolved.uniq

Also, I've found fgrep to be significantly faster if I'm trying to match fixed strings, so that might be an alternative.

like image 111
RS239 Avatar answered Dec 09 '22 11:12

RS239


Grep is probably parsing that file entirely for each and every match it's been told to find. You can try "fgrep" if it exists on your system, or grep -F if it doesn't, which forces grep to use the Aho-Corasick string matching algorithm (http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm), which attempts to match all strings simultaneously, only necessitating one run-through of the file.

like image 39
Penny Perdition Avatar answered Dec 09 '22 11:12

Penny Perdition