file.contain.query.txt
ENST001
ENST002
ENST003
file.to.search.in.txt
ENST001 90
ENST002 80
ENST004 50
Because ENST003 has no entry in 2nd file and ENST004 has no entry in 1st file the expected output is:
ENST001 90
ENST002 80
To grep multi query in a particular file we usually do the following:
grep -f file.contain.query <file.to.search.in >output.file
since I have like 10000 query and almost 100000 raw in file.to.search.in it takes very long time to finish (like 5 hours). Is there a fast alternative to grep -f ?
If you want a pure Perl option, read your query file keys into a hash table, then check standard input against those keys:
#!/usr/bin/env perl
use strict;
use warnings;
# build hash table of keys
my $keyring;
open KEYS, "< file.contain.query.txt";
while (<KEYS>) {
chomp $_;
$keyring->{$_} = 1;
}
close KEYS;
# look up key from each line of standard input
while (<STDIN>) {
chomp $_;
my ($key, $value) = split("\t", $_); # assuming search file is tab-delimited; replace delimiter as needed
if (defined $keyring->{$key}) { print "$_\n"; }
}
You'd use it like so:
lookup.pl < file.to.search.txt
A hash table can take a fair amount of memory, but searches are much faster (hash table lookups are in constant time), which is handy since you have 10-fold more keys to lookup than to store.
If you have fixed strings, use grep -F -f
. This is significantly faster than regex search.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With