I have a diff that essentially equates to either additional unique lines or to lines that have moved around in the file, and thus their line numbers have changed. To identify what is truly a new addition, I run this little perl snippet to separate the 'resolved' lines from the 'unresolved' lines:
perl -n -e'
/^\-([^\-].*?)\([^,\(]+,\d+,\d+\).*$/ && do { print STDOUT "$1\n"; next; };
/^\+([^\+].*?)\([^,\(]+,\d+,\d+\).*$/ && do { print STDERR "$1\n"; next; };
' "$delta" 1>resolved 2>unresolved
This is quite quick in fact and does the job, separating a 6000+ line diff into two 3000+ line files, removing any references to line numbers and unified diff decoration. Next comes the grep command that seems to run at 100% CPU for nearly 9 minutes (real):
grep -v -f resolved unresolved
This is essentially removing all resolved lines from the unresolved file. The output, after 9 minutes, is coincidentally 9 lines of output - the unique additions or unresolved lines.
Firstly, when I have used grep in the past, it's been pretty good at this, so why in this case is it being exceptionally slow and CPU hungry?
Secondly, is there a more efficient alternative way of removing lines from one file that are contained within another?
If the lines to be matched across the two files are supposed to be exact matches, you can use sort and uniq to do the job:
cat resolved resolved unresolved | sort | uniq -u
The only non-duplicated lines in the pipeline above will be lines in unresolved that are not in resolved. Note that it's important to specify resolved twice in the cat command: otherwise the uniq will also pick out lines unique to that file. This assumes that resolved and unresolved didn't have duplicated lines to begin with. But that's pretty easy to deal with: just sort and uniq them first
sort resolved | uniq > resolved.uniq
sort unresolved | uniq > unresolved.uniq
Also, I've found fgrep to be significantly faster if I'm trying to match fixed strings, so that might be an alternative.
Grep is probably parsing that file entirely for each and every match it's been told to find. You can try "fgrep" if it exists on your system, or grep -F if it doesn't, which forces grep to use the Aho-Corasick string matching algorithm (http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm), which attempts to match all strings simultaneously, only necessitating one run-through of the file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With