I have a file with more than 40.000 lines (file1) and I want to extract the lines matching patterns in file2 (about 6000 lines). I use grep like this, but it is very slow:
grep -f file2 file1 > out
Is there a faster way to do this using awk
or sed
?
Here's some extracts from my files:
File1
:
scitn003869.2| scign003869 CGCATGTGTGCATGTATTATCGTATCCCTTG
scitn007747.1| scign007747 CACGCAGACGCAGTGGAGCATTCCAGGTCACAA
scitn003155.1| scign003155 TAAAAATCGTTAGCACTCGCTTGGTACACTAAC
scitn018252.1| scign018252 CGTGTGTGTGCATATGTGTGCATGCGTG
scitn004671.2| scign004671 TCCTCAGGTTTTGAAAGGCAGGGTAAGTGCT
File2
:
scign000003
scign000004
scign000005
scign004671
scign000013
The sed command will, by default, print the pattern space at the end of each cycle. However, in this example, we only want to ask sed to print the lines we need. Therefore, we've used the -n option to prevent the sed command from printing the pattern space. Instead, we'll control the output using the p command.
grep = global regular expression print In the simplest terms, grep (global regular expression print) will search input files for a search string, and print the lines that match it.
grep With Multiword Strings The grep command can search for a string in groups of files. When it finds a pattern that matches in more than one file, it prints the name of the file, followed by a colon, then the line matching the pattern.
Try grep -Fwf file2 file1 > out
The -F
option specifies plain string matching, so should be faster without having to engage the regex engine.
Here's how to do it in awk:
awk 'NR==FNR{pats[$0]; next} $2 in pats' File2 File1
Using a 60,000 line File1 (your File1 repeated 8000 times) and a 6,000 File2 (yours repeated 1200 times):
$ time grep -Fwf File2 File1 > ou2
real 0m0.094s
user 0m0.031s
sys 0m0.062s
$ time awk 'NR==FNR{pats[$0]; next} $2 in pats' File2 File1 > ou1
real 0m0.094s
user 0m0.015s
sys 0m0.077s
$ diff ou1 ou2
i.e. it's about as fast as the grep. One thing to note though is that the awk solution lets you pick a specific field to match on so if anything from File2 shows up anywhere else in File1 you won't get a false match. It also lets you match on a whole field at a time so if your target strings were various lengths and you didn't want "scign000003" to match "scign0000031" for example (though the -w for grep gives similar protection for that).
For completeness, here's the timing for the other awk solution posted elsethread:
$ time awk 'BEGIN{i=0}FNR==NR{a[i++]=$1;next}{for(j=0;j<i;j++)if(index($0,a[j]))print $0}' File2 File1 > ou3
real 3m34.110s
user 3m30.850s
sys 0m1.263s
and here's the timing I get for the perl script Mark posted:
$ time ./go.pl > out2
real 0m0.203s
user 0m0.124s
sys 0m0.062s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With