awk unix - match regex - regex string size limit | ideas?

Tags:

awk

The following code works as a minimal example. It searches a regular expression with one mismatch inside a text (later a large DNA file).

awk 'BEGIN{print match("CTGGGTCATTAAATCGTTAGC...", /.ATC|A.TC|AA.C|AAT./)}'

Later I am interested in the position where the regular expression is found. Therefore the awk command is more complex. Like it is solved here

If I want to search with more mismatches and a longer string I will come up with very long regex expressions:

example: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" with 3 mismatches "." allowed:
/
...AAAAAAAAAAAAAAAAAAAAAAAAAAA|
..A.AAAAAAAAAAAAAAAAAAAAAAAAAA|
..AA.AAAAAAAAAAAAAAAAAAAAAAAAA|
-
- and so on. (actually 4060 possibilities)

/

The problem with my solution is:

very long regex will not be accepted by awk! (limit seems to be at roughly about 80.000 characters)
Error: "bash: /usr/bin/awk: Argument list too long"
possible solution: SO-Link but I don't find the solution...

My question is:

Can I somehow still use the long regex expression?
- splitting the string and running the command multiple times could be a solution, but then I will get duplicated results.
Is there another way to approach this?
- ("agrep" will work, but not to find the positions)

589

asked May 10 '21 09:05

Video Answer

1 Answers

As Jonathan Leffler points out in comments your issue in the first case (bash: /usr/bin/awk: Argument list too long) is from the shell and you can solve that by putting your awk script in a file.

As he also points out, your fundamental approach is not optimal. Below are two alternatives.

Perl has many features that will aid you with this.

You can use the ^ XOR operator on two strings that will return \x00 where the strings match and another character where they don't match. March through the longer string XORing against the shorter with a max substitution count and there you are:

use strict;
use warnings;
use 5.014;

my $seq = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT";
my $pat     = "AAAAAA";
my $max_subs = 3;

my $len_in  = length $seq;
my $len_pat = length $pat;
my %posn;

sub strDiffMaxDelta {
    my ( $s1, $s2, $maxDelta ) = @_;
    
    # XOR the strings to find the count of differences
    my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
    return $diffCount <= $maxDelta;
}

for my $i ( 0 .. $len_in - $len_pat ) { 
    my $substr = substr $seq, $i, $len_pat;
    # save position if there is a match up to $max_subs substitutions
    $posn{$i} = $substr if strDiffMaxDelta( $pat, $substr, $max_subs );
}

say "$_ => $posn{$_}" for sort { $a <=> $b } keys %posn;

Running this prints:

6 => AATCCA
9 => CCAGAA
10 => CAGAAC
11 => AGAACG
13 => AACGCA
60 => CAATCA
61 => AATCAA
62 => ATCAAA
63 => TCAAAT

Substituting:

$seq=AAATCGAAAAGCDFAAAACGT;
$pat=AATC;
$max_subs=1;

Prints:

1 => AATC
8 => AAGC
15 => AAAC

It is also easy (in the same style as awk) to convert this to 'magic input' from either stdin or a file.

You can also write a similar approach in awk:

echo "AAATCGAAAAGCDFAAAACGT" | awk -v mc=1 -v seq="AATC" '
{
for(i=1; i<=length($1)-length(seq)+1; i++) {
    cnt=0
    for(j=1;j<=length(seq); j++) 
        if(substr($1,i+j-1,1)!=substr(seq,j,1)) cnt++
    if (cnt<=mc) print i-1 " => " substr($1,i, length(seq)) 
    }
}'

Prints:

1 => AATC
8 => AAGC
15 => AAAC

And the same result with the longer example above. Since the input is moved to STDIN (or a file) and the regex does not need to be HUGE, this should get you started either with Perl or Awk.

(Be aware that the first character of a string is offset 1 in awk and offset 0 in Perl...)

answered Oct 13 '22 22:10

dawg

Related questions
                            
                                replace all line breaks not precede by a period with a regular expression?
                            
                                Replacing multiple occurrences of characters
                            
                                Fuzzy string-matching that can "skip"? e.g. "i am (.*)." has 0 distance to "I am here."
                            
                                Keep caret position in contenteditable after editing the content via jscript
                            
                                How does Facebook's URL matching algorithm work? [duplicate]
                            
                                match ASCII characters except alphanumeric
                            
                                git log with perl regex
                            
                                Vim matches a rectangle area
                            
                                Token pattern for n-gram in TfidfVectorizer in python
                            
                                Syntax Highlighting performance issue
                            
                                Java regex how to find the parent match?
                            
                                Why double slash dot (ie: \\.) in htaccess regex?
                            
                                regex match character with specific diacritic
                            
                                Oracle SSO URL regex to exclude if URL has an specific param
                            
                                How to handle with words which have space between characters?
                            
                                What's an "additional tie breaker" for Perl 6 longest token matching?
                            
                                How to determine the number of possible combinations of letters that contain a degenerate substring
                            
                                Extracting codes with optional special characters from a string using Regex in C#
                            
                                Regex for autoformatting a phone field as user is typing
                            
                                How do I parse a chemical formula using a regular expression?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

awk unix - match regex - regex string size limit | ideas?

Tags:

regex

awk

Lucas

People also ask

Video Answer

1 Answers

dawg

Recent Activity

Donate For Us