Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

awk unix - match regex - regex string size limit | ideas?

Tags:

regex

awk

The following code works as a minimal example. It searches a regular expression with one mismatch inside a text (later a large DNA file).

awk 'BEGIN{print match("CTGGGTCATTAAATCGTTAGC...", /.ATC|A.TC|AA.C|AAT./)}'

Later I am interested in the position where the regular expression is found. Therefore the awk command is more complex. Like it is solved here

If I want to search with more mismatches and a longer string I will come up with very long regex expressions:

example: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" with 3 mismatches "." allowed:
/
...AAAAAAAAAAAAAAAAAAAAAAAAAAA|
..A.AAAAAAAAAAAAAAAAAAAAAAAAAA|
..AA.AAAAAAAAAAAAAAAAAAAAAAAAA|
-
- and so on. (actually 4060 possibilities)

/

The problem with my solution is:

  • very long regex will not be accepted by awk! (limit seems to be at roughly about 80.000 characters)
  • Error: "bash: /usr/bin/awk: Argument list too long"
  • possible solution: SO-Link but I don't find the solution...

My question is:

  • Can I somehow still use the long regex expression?
    • splitting the string and running the command multiple times could be a solution, but then I will get duplicated results.
  • Is there another way to approach this?
    • ("agrep" will work, but not to find the positions)
like image 589
Lucas Avatar asked May 10 '21 09:05

Lucas


People also ask

Can I use regex with awk?

In awk, regular expressions (regex) allow for dynamic and complex pattern definitions. You're not limited to searching for simple strings but also patterns within patterns.

How do I use awk to find a string?

Using Awk with set [ character(s) ] Take for example the set [al1] , here awk will match all strings containing character a or l or 1 in a line in the file /etc/hosts.

What is pattern matching in awk?

Any awk expression is valid as an awk pattern. The pattern matches if the expression's value is nonzero (if a number) or non-null (if a string). The expression is reevaluated each time the rule is tested against a new input record.

What is awk in regular expression?

A regular expression enclosed in slashes ( `/' ) is an awk pattern that matches every input record whose text belongs to that set. The simplest regular expression is a sequence of letters, numbers, or both. Such a regexp matches any string that contains that sequence.


Video Answer


1 Answers

As Jonathan Leffler points out in comments your issue in the first case (bash: /usr/bin/awk: Argument list too long) is from the shell and you can solve that by putting your awk script in a file.

As he also points out, your fundamental approach is not optimal. Below are two alternatives.


Perl has many features that will aid you with this.

You can use the ^ XOR operator on two strings that will return \x00 where the strings match and another character where they don't match. March through the longer string XORing against the shorter with a max substitution count and there you are:

use strict;
use warnings;
use 5.014;

my $seq = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT";
my $pat     = "AAAAAA";
my $max_subs = 3;

my $len_in  = length $seq;
my $len_pat = length $pat;
my %posn;

sub strDiffMaxDelta {
    my ( $s1, $s2, $maxDelta ) = @_;
    
    # XOR the strings to find the count of differences
    my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
    return $diffCount <= $maxDelta;
}

for my $i ( 0 .. $len_in - $len_pat ) { 
    my $substr = substr $seq, $i, $len_pat;
    # save position if there is a match up to $max_subs substitutions
    $posn{$i} = $substr if strDiffMaxDelta( $pat, $substr, $max_subs );
}

say "$_ => $posn{$_}" for sort { $a <=> $b } keys %posn;

Running this prints:

6 => AATCCA
9 => CCAGAA
10 => CAGAAC
11 => AGAACG
13 => AACGCA
60 => CAATCA
61 => AATCAA
62 => ATCAAA
63 => TCAAAT

Substituting:

$seq=AAATCGAAAAGCDFAAAACGT;
$pat=AATC;
$max_subs=1;

Prints:

1 => AATC
8 => AAGC
15 => AAAC

It is also easy (in the same style as awk) to convert this to 'magic input' from either stdin or a file.


You can also write a similar approach in awk:

echo "AAATCGAAAAGCDFAAAACGT" | awk -v mc=1 -v seq="AATC" '
{
for(i=1; i<=length($1)-length(seq)+1; i++) {
    cnt=0
    for(j=1;j<=length(seq); j++) 
        if(substr($1,i+j-1,1)!=substr(seq,j,1)) cnt++
    if (cnt<=mc) print i-1 " => " substr($1,i, length(seq)) 
    }
}'

Prints:

1 => AATC
8 => AAGC
15 => AAAC

And the same result with the longer example above. Since the input is moved to STDIN (or a file) and the regex does not need to be HUGE, this should get you started either with Perl or Awk.

(Be aware that the first character of a string is offset 1 in awk and offset 0 in Perl...)

like image 62
dawg Avatar answered Oct 13 '22 22:10

dawg