I have a dataset with 20 000 probes, they are in two columns, 21nts each. From this file I need to extract the lines in which last nucleotide in Probe1 column matches last nucleotide in in Probe 2 column. So far I tried AWK (substr) function, but didn't get the expected outcome. Here is one-liner I tried: <pre class="prettyprint"><code>awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}' </code></pre> Another option would be to anchor last character in columns 2 and 4 (<code>awk '$2~/[A-Z]$/</code>), but I can't find a way to match the probes in two columns using regex. All suggestions and comments will be very much appreciated. Example of dataset: <pre class="prettyprint"><code> Probe 1 Probe 2 4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA 4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG 4738 GGAGGATTTGGCCGGAGAGGC C GGAGGAGGAGGAGGACGAGGT 4739 GGAGGAAGAGGAGGGGGAGGT D GGAGGACGAGGAGGAGGAGGC 4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC </code></pre> Desired output: <pre class="prettyprint"><code>4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA 4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG 4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC </code></pre>

This will filter the input, matching lines where the last character of the 2nd column is equal to the last character of the 4th column: <pre class="prettyprint"><code>awk 'substr($2, length($2), 1) == substr($4, length($4), 1)' </code></pre> What I changed compared to your sample script: <ul> <li>Move the <code>if</code> statement out of the <code>{ ... }</code> block into a filter</li> <li>Use <code>length($2)</code> and <code>length($4)</code> instead of hardcoding the value 21</li> <li>The <code>{ print $0 }</code> is not needed, as that is the default action for the matched lines</li> </ul>

Awk: how to compare two strings in one line

Tags:

bash

awk

I have a dataset with 20 000 probes, they are in two columns, 21nts each. From this file I need to extract the lines in which last nucleotide in Probe1 column matches last nucleotide in in Probe 2 column. So far I tried AWK (substr) function, but didn't get the expected outcome. Here is one-liner I tried:

awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'

Another option would be to anchor last character in columns 2 and 4 (awk '$2~/[A-Z]$/), but I can't find a way to match the probes in two columns using regex. All suggestions and comments will be very much appreciated.

Example of dataset:

        Probe 1                     Probe 2
4736    GGAGGAAGAGGAGGCGGAGGA   A   GGAGGACGAGGAGGAGGAGGA
4737    GGAGGAAGAGGAGGGAGAGGG   B   GGAGGACGAGGAGGAGGAGGG
4738    GGAGGATTTGGCCGGAGAGGC   C   GGAGGAGGAGGAGGACGAGGT
4739    GGAGGAAGAGGAGGGGGAGGT   D   GGAGGACGAGGAGGAGGAGGC
4740    GGAGGAAGAGGAGGGGGAGGC   E   GGAGGAGGAGGACGAGGAGGC

Desired output:

4736    GGAGGAAGAGGAGGCGGAGGA   A   GGAGGACGAGGAGGAGGAGGA
4737    GGAGGAAGAGGAGGGAGAGGG   B   GGAGGACGAGGAGGAGGAGGG
4740    GGAGGAAGAGGAGGGGGAGGC   E   GGAGGAGGAGGACGAGGAGGC

600

asked Nov 27 '16 14:11

Bio21

Video Answer

1 Answers

This will filter the input, matching lines where the last character of the 2nd column is equal to the last character of the 4th column:

awk 'substr($2, length($2), 1) == substr($4, length($4), 1)'

What I changed compared to your sample script:

Move the if statement out of the { ... } block into a filter
Use length($2) and length($4) instead of hardcoding the value 21
The { print $0 } is not needed, as that is the default action for the matched lines

183

answered Sep 28 '22 08:09

janos

Related questions
                            
                                Substring substitution in bash
                            
                                How to remove postfix from a string in bash?
                            
                                for loop control in bash using a string
                            
                                Mass arguments (operands) at first place in command line argument passing
                            
                                node.js how to show stdin input with child_process.exec
                            
                                Why does geany use #~ for comments in bash instead of just #?
                            
                                Returning values from functions when efficiency matters
                            
                                Ctrl + C to terminate "grunt watch", but kills Atom editor which started from the same bash, why?
                            
                                Best practices on setting exit status codes
                            
                                Counting words and characters in Bash without wc [duplicate]
                            
                                formatted printing in awk [duplicate]
                            
                                rm !(file name) is not working in ubuntu it showing error " bash: !: event not found"
                            
                                Starting/stopping a background Python process wtihout nohup + ps aux grep + kill
                            
                                bash - surround all array elements or arguments with quotes
                            
                                Shell Script unit testing: How to mockup a complex utility program
                            
                                xargs command length limits
                            
                                Creating command line alias with python
                            
                                Shell script to check if the process is already running and exit if yes
                            
                                Using environment variables from ~/.bashrc in "npm start"?
                            
                                BASH: Display two files side by side simultaneously

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With