Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Awk: how to compare two strings in one line

Tags:

bash

awk

I have a dataset with 20 000 probes, they are in two columns, 21nts each. From this file I need to extract the lines in which last nucleotide in Probe1 column matches last nucleotide in in Probe 2 column. So far I tried AWK (substr) function, but didn't get the expected outcome. Here is one-liner I tried:

awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'

Another option would be to anchor last character in columns 2 and 4 (awk '$2~/[A-Z]$/), but I can't find a way to match the probes in two columns using regex. All suggestions and comments will be very much appreciated.

Example of dataset:

        Probe 1                     Probe 2
4736    GGAGGAAGAGGAGGCGGAGGA   A   GGAGGACGAGGAGGAGGAGGA
4737    GGAGGAAGAGGAGGGAGAGGG   B   GGAGGACGAGGAGGAGGAGGG
4738    GGAGGATTTGGCCGGAGAGGC   C   GGAGGAGGAGGAGGACGAGGT
4739    GGAGGAAGAGGAGGGGGAGGT   D   GGAGGACGAGGAGGAGGAGGC
4740    GGAGGAAGAGGAGGGGGAGGC   E   GGAGGAGGAGGACGAGGAGGC

Desired output:

4736    GGAGGAAGAGGAGGCGGAGGA   A   GGAGGACGAGGAGGAGGAGGA
4737    GGAGGAAGAGGAGGGAGAGGG   B   GGAGGACGAGGAGGAGGAGGG
4740    GGAGGAAGAGGAGGGGGAGGC   E   GGAGGAGGAGGACGAGGAGGC
like image 600
Bio21 Avatar asked Nov 27 '16 14:11

Bio21


People also ask

How to do string comparison in AWK?

You can do string comparison in awk using standard boolean operators, unlike in C where you would have to use strcmp (). Show activity on this post. Thanks for contributing an answer to Stack Overflow! Please be sure to answer the question. Provide details and share your research!

Is it possible to do inequality testing in AWK?

You can also do inequality (ordered) testing as well: Show activity on this post. You can do string comparison in awk using standard boolean operators, unlike in C where you would have to use strcmp (). Show activity on this post. Thanks for contributing an answer to Stack Overflow!

Is it possible to write a file without AWK?

If it's always in that format (one fcs for one disk, fcs always after disk), you could do without awk: Though with awk, you may prefer a more legible approach as given by Martin or sp asic.


Video Answer


1 Answers

This will filter the input, matching lines where the last character of the 2nd column is equal to the last character of the 4th column:

awk 'substr($2, length($2), 1) == substr($4, length($4), 1)'

What I changed compared to your sample script:

  • Move the if statement out of the { ... } block into a filter
  • Use length($2) and length($4) instead of hardcoding the value 21
  • The { print $0 } is not needed, as that is the default action for the matched lines
like image 183
janos Avatar answered Sep 28 '22 08:09

janos