Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extract all lines from text file based on a given list of IDs

Tags:

unix

awk

perl

I have 2 text files. file1 contains a list of IDs:

11002
10995
48981
79600

file2:

10993   item    0
11002   item    6
10995   item    7
79600   item    7
439481  item    5
272557  item    7
224325  item    7
84156   item    6
572546  item    7
693661  item    7
.....

I am trying to select all lines from file2 where the ID (first column) is in file1. Currently, what I am doing is to loop through the first file to create a regex like:

^\b11002\b\|^\b10995\b\|^\b48981\b|^\b79600\b

Then run:

grep '^11002\|^10995\|^48981|^79600' file2.txt

But when the number of IDs in file1 is too large (~2000), the regular expression becomes quite long and grep becomes slow. Is there another way? I am using Perl + Awk + Unix.

like image 473
jjennifer Avatar asked Dec 04 '22 13:12

jjennifer


1 Answers

Use a hash table. It can be memory-intensive but lookups are in constant time. This is an efficient and correct procedure — not the only one, but efficient and correct — for creating a hash table, using file1 as keys and file2 for looking up keys in the hash table. If a key is in the hash table, the line is printed to standard output:

#!/usr/bin/env perl

use strict;
use warnings;

open FILE1, "< file1" or die "could not open file1\n";
my $keyRef;
while (<FILE1>) {
   chomp;
   $keyRef->{$_} = 1;
}
close FILE1;

open FILE2, "< file2" or die "could not open file2\n";
while (<FILE2>) {
    chomp;
    my ($testKey, $label, $count) = split("\t", $_);
    if (defined $keyRef->{$testKey}) {
        print STDOUT "$_\n";
    }
}
close FILE2;

There are lots of ways to do the same thing in Perl. That said, I value clarity and explicitness over fancy obscurity, because you never know when you have to come back to a Perl script and make changes, and they are hard enough to manage, as it is. One person's opinion.

like image 140
Alex Reynolds Avatar answered Dec 07 '22 02:12

Alex Reynolds