Print lines in one file matching patterns in another file

Tags:

I have a file with more than 40.000 lines (file1) and I want to extract the lines matching patterns in file2 (about 6000 lines). I use grep like this, but it is very slow:

grep -f file2 file1 > out

Is there a faster way to do this using awk or sed?

Here's some extracts from my files:

File1:

scitn003869.2| scign003869 CGCATGTGTGCATGTATTATCGTATCCCTTG
scitn007747.1| scign007747  CACGCAGACGCAGTGGAGCATTCCAGGTCACAA
scitn003155.1| scign003155  TAAAAATCGTTAGCACTCGCTTGGTACACTAAC
scitn018252.1| scign018252  CGTGTGTGTGCATATGTGTGCATGCGTG
scitn004671.2| scign004671  TCCTCAGGTTTTGAAAGGCAGGGTAAGTGCT

File2:

scign000003
scign000004
scign000005
scign004671
scign000013

800

asked Jan 27 '14 18:01

2 Answers

Try grep -Fwf file2 file1 > out

The -F option specifies plain string matching, so should be faster without having to engage the regex engine.

174

answered Oct 16 '22 15:10

glenn jackman

Here's how to do it in awk:

awk 'NR==FNR{pats[$0]; next} $2 in pats' File2 File1

Using a 60,000 line File1 (your File1 repeated 8000 times) and a 6,000 File2 (yours repeated 1200 times):

$ time grep -Fwf File2 File1 > ou2

real    0m0.094s
user    0m0.031s
sys     0m0.062s

$ time awk 'NR==FNR{pats[$0]; next} $2 in pats' File2 File1 > ou1

real    0m0.094s
user    0m0.015s
sys     0m0.077s

$ diff ou1 ou2

i.e. it's about as fast as the grep. One thing to note though is that the awk solution lets you pick a specific field to match on so if anything from File2 shows up anywhere else in File1 you won't get a false match. It also lets you match on a whole field at a time so if your target strings were various lengths and you didn't want "scign000003" to match "scign0000031" for example (though the -w for grep gives similar protection for that).

For completeness, here's the timing for the other awk solution posted elsethread:

$ time awk 'BEGIN{i=0}FNR==NR{a[i++]=$1;next}{for(j=0;j<i;j++)if(index($0,a[j]))print $0}' File2 File1 > ou3

real    3m34.110s
user    3m30.850s
sys     0m1.263s

and here's the timing I get for the perl script Mark posted:

$ time ./go.pl > out2

real    0m0.203s
user    0m0.124s
sys     0m0.062s

answered Oct 16 '22 15:10

Ed Morton

Related questions
                            
                                Why doesn't "sort file1 > file1" work?
                            
                                Using 'find' to return filenames without extension
                            
                                Centos - "locate" command doesn't work
                            
                                Difference between $() and () in Bash
                            
                                What is the access time in Unix
                            
                                Extract package.json version using shell script
                            
                                Script to create individual zip files for each .txt file it finds and move them after
                            
                                How to add values in a variable in Unix shell scripting?
                            
                                Recent files in folder
                            
                                Can't install CRON job [closed]
                            
                                using bash: write bit representation of integer to file
                            
                                Parse string with bash and extract number
                            
                                How do I read the Nth line of a file and print it to a new file? [duplicate]
                            
                                How to remove duplicate lines from a file
                            
                                Rename file command in Unix with timestamp
                            
                                python setup.py sdist error: Operation not permitted
                            
                                Remove all files in a directory (do not touch any folders or anything within them)
                            
                                sshpass: command not found error
                            
                                What actually the meaning of "-n" in sed?
                            
                                How to perform the reverse of `xargs`?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Print lines in one file matching patterns in another file

Tags:

grep

unix

sed

awk

extract

Jon

People also ask

2 Answers

glenn jackman

Ed Morton

Recent Activity

Donate For Us