Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Obtain patterns in one file from another using ack or awk or better way than grep?

Is there a way to obtain patterns in one file (a list of patterns) from another file using ack as the -f option in grep? I see there is an -f option in ack but it's different with the -f in grep.

Perhaps an example will give you a better idea. Suppose I have file1:

file1:
a
c
e

And file2:

file2:
a  1
b  2
c  3
d  4
e  5

And I want to obtain all the patterns in file1 from file2 to give:

a  1
c  3
e  5

Can ack do this? Otherwise, is there a better way to handle the job (such like awk or using hash) because I have millions of records in both files and really need an efficient way to complete? Thanks!

like image 256
Rock Avatar asked Mar 30 '12 04:03

Rock


3 Answers

Here's a Perl one-liner that uses a hash to hold the set of wanted keys from file1 for O(1) (amortized time) lookups per iteration over the lines of file2. So it will run in O(m+n) time, where m is number of lines in your key set, and n is the number of lines in the file you're testing.

perl -ne'BEGIN{open K,shift@ARGV;chomp(@a=<K>);@hash{@a}=()}m/^(\p{alpha}+)\s/&&exists$hash{$1}&&print' tkeys file2

The key set will be held in memory while file2 is tested line by line against the keys.

Here's the same thing using Perl's -a command line option:

perl -ane'BEGIN{open G,shift@ARGV;chomp(@a=<G>);@h{@a}=();}exists$h{$F[0]}&&print' tkeys file2

The second version is probably a little easier on the eyes. ;)

One thing you have to remember here is that it's more likely that you're IO bound than processor bound. So the goal should be to minimize IO use. When the entire lookup key set is held in a hash that offers O(1) amortized lookups. The advantage this solution may have over other solutions is that some (slower) solutions will have to run through your key file (file1) one time for each line of file2. That sort of solution will be O(m*n) where m is the size of your key file, and n is the size of file2. On the other hand, this hash approach provides O(m+n) time. That's a magnitude of difference. It benefits by eliminating linear searches through the key-set, and further benefits by reading the keys via IO only one time.

like image 132
DavidO Avatar answered Sep 21 '22 01:09

DavidO


Well okay, if we've switched from comments to answers... ;-)

Here's an awk one-liner that does the same as DavidO's perl one-liner, but in awk. Awk is smaller and possibly leaner than Perl. But there are a few different implementations of awk. I have no idea whether yours will perform better than others, or than perl. You'll need to benchmark.

awk 'NR==FNR{a[$0]=1;next} {n=0;for(i in a){if($0~i){n=1}}} n' file1 file2

What does (should) this do?

The first part of the awk script matches only lines in file1 (where the record number in the current file equals the record number in total), and populates the array. The second part (which runs on subsequent files) steps through each item in the array and sees if it can be used as a regexp to match the current input line.

The second block of code starts with an "n", which was set either to 0 or 1 in the previous block. In awk, "1" evaluates as true, and a missing curly-bracket block is considered equivalent to {print}, so if the previous block found a match, this one will print the current line.

If file1 contains strings instead of regexps, then you can change this to make it run faster by replacing the first comparison with if(index($0,i))....

Use with caution. Your mileage may vary. Created in a facility that may contain nuts.

like image 32
ghoti Avatar answered Sep 22 '22 01:09

ghoti


nawk 'FNR==NR{a[$0];next}($1 in a)' file3 file4

tested:

pearl.384> cat file3
a
c
e
pearl.385> cat file4
a  1 
b  2 
c  3 
d  4 
e  5
pearl.386> nawk 'FNR==NR{a[$0];next}($1 in a)' file3 file4
a  1 
c  3 
e  5
pearl.387>
like image 29
Vijay Avatar answered Sep 22 '22 01:09

Vijay