I have 2 text files. file1
contains a list of IDs:
11002
10995
48981
79600
file2
:
10993 item 0
11002 item 6
10995 item 7
79600 item 7
439481 item 5
272557 item 7
224325 item 7
84156 item 6
572546 item 7
693661 item 7
.....
I am trying to select all lines from file2
where the ID (first column) is in file1
. Currently, what I am doing is to loop through the first file to create a regex like:
^\b11002\b\|^\b10995\b\|^\b48981\b|^\b79600\b
Then run:
grep '^11002\|^10995\|^48981|^79600' file2.txt
But when the number of IDs in file1
is too large (~2000), the regular expression becomes quite long and grep
becomes slow. Is there another way? I am using Perl + Awk + Unix.
Use a hash table. It can be memory-intensive but lookups are in constant time. This is an efficient and correct procedure — not the only one, but efficient and correct — for creating a hash table, using file1
as keys and file2
for looking up keys in the hash table. If a key is in the hash table, the line is printed to standard output:
#!/usr/bin/env perl
use strict;
use warnings;
open FILE1, "< file1" or die "could not open file1\n";
my $keyRef;
while (<FILE1>) {
chomp;
$keyRef->{$_} = 1;
}
close FILE1;
open FILE2, "< file2" or die "could not open file2\n";
while (<FILE2>) {
chomp;
my ($testKey, $label, $count) = split("\t", $_);
if (defined $keyRef->{$testKey}) {
print STDOUT "$_\n";
}
}
close FILE2;
There are lots of ways to do the same thing in Perl. That said, I value clarity and explicitness over fancy obscurity, because you never know when you have to come back to a Perl script and make changes, and they are hard enough to manage, as it is. One person's opinion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With