Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching text using grep or awk

Tags:

grep

awk

I am having problems with grep and awk. I think it's because my input file contains text that looks like code.

The input file contains ID names and looks like this:

SNORD115-40
MIR432
RNU6-2

The reference file looks like this:

Ensembl Gene ID HGNC symbol
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000266661
ENSG00000243133
ENSG00000207447 RNU6-2

I want to match the ID names from my source file with my reference file and print out the corresponding ensg ID numbers so that the output file looks like this:

ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2

I have tried this loop:

exec < source.file
while read line
do
grep -w $line reference.file > outputfile
done

I've also tried playing around with the reference file using awk

awk 'NF == 2 {print $0}' reference file
awk 'NF >2 {print $0}' reference file

but I only get one of the grep'd IDs.

Any suggestions or easier ways of doing this would be great.

like image 549
user1879573 Avatar asked Dec 11 '22 15:12

user1879573


1 Answers

$ fgrep -f source.file reference.file 
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2

fgrep is equivalent to grep -F:

   -F, --fixed-strings
          Interpret  PATTERN  as  a  list  of  fixed strings, separated by
          newlines, any of which is to be matched.  (-F  is  specified  by
          POSIX.)

The -f option is for taking PATTERN from a file:

   -f FILE, --file=FILE
          Obtain  patterns  from  FILE,  one  per  line.   The  empty file
          contains zero patterns, and therefore matches nothing.   (-f  is
          specified by POSIX.)

As noted in the comments, this can produce false positives if an ID in reference.file contains an ID in source.file as a substring. You can construct a more definitive pattern for grep on the fly with sed:

grep -f <( sed 's/.*/ &$/' input.file) reference.file

But this way the patterns are interpreted as regular expressions and not as fixed strings, which is potentially vulnerable (although may be OK if the IDs only contain alphanumeric characters). The better way, though (thanks to @sidharthcnadhan), is to use the -w option:

   -w, --word-regexp
          Select  only  those  lines  containing  matches  that form whole
          words.  The test is that the matching substring must  either  be
          at  the  beginning  of  the  line,  or  preceded  by  a non-word
          constituent character.  Similarly, it must be either at the  end
          of  the  line  or  followed by a non-word constituent character.
          Word-constituent  characters  are  letters,  digits,   and   the
          underscore.

So the final answer to your question is:

grep -Fwf source.file reference.file
like image 141
Lev Levitsky Avatar answered Feb 16 '23 00:02

Lev Levitsky