The following code works as a minimal example. It searches a regular expression with one mismatch inside a text (later a large DNA file).
awk 'BEGIN{print match("CTGGGTCATTAAATCGTTAGC...", /.ATC|A.TC|AA.C|AAT./)}'
Later I am interested in the position where the regular expression is found. Therefore the awk command is more complex. Like it is solved here
If I want to search with more mismatches and a longer string I will come up with very long regex expressions:
example: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" with 3 mismatches "." allowed:
/
...AAAAAAAAAAAAAAAAAAAAAAAAAAA|
..A.AAAAAAAAAAAAAAAAAAAAAAAAAA|
..AA.AAAAAAAAAAAAAAAAAAAAAAAAA|
-
- and so on. (actually 4060 possibilities)
/
The problem with my solution is:
My question is:
In awk, regular expressions (regex) allow for dynamic and complex pattern definitions. You're not limited to searching for simple strings but also patterns within patterns.
Using Awk with set [ character(s) ] Take for example the set [al1] , here awk will match all strings containing character a or l or 1 in a line in the file /etc/hosts.
Any awk expression is valid as an awk pattern. The pattern matches if the expression's value is nonzero (if a number) or non-null (if a string). The expression is reevaluated each time the rule is tested against a new input record.
A regular expression enclosed in slashes ( `/' ) is an awk pattern that matches every input record whose text belongs to that set. The simplest regular expression is a sequence of letters, numbers, or both. Such a regexp matches any string that contains that sequence.
As Jonathan Leffler points out in comments your issue in the first case (bash: /usr/bin/awk: Argument list too long
) is from the shell and you can solve that by putting your awk script in a file.
As he also points out, your fundamental approach is not optimal. Below are two alternatives.
Perl has many features that will aid you with this.
You can use the ^
XOR operator on two strings that will return \x00
where the strings match and another character where they don't match. March through the longer string XORing against the shorter with a max substitution count and there you are:
use strict;
use warnings;
use 5.014;
my $seq = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT";
my $pat = "AAAAAA";
my $max_subs = 3;
my $len_in = length $seq;
my $len_pat = length $pat;
my %posn;
sub strDiffMaxDelta {
my ( $s1, $s2, $maxDelta ) = @_;
# XOR the strings to find the count of differences
my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
return $diffCount <= $maxDelta;
}
for my $i ( 0 .. $len_in - $len_pat ) {
my $substr = substr $seq, $i, $len_pat;
# save position if there is a match up to $max_subs substitutions
$posn{$i} = $substr if strDiffMaxDelta( $pat, $substr, $max_subs );
}
say "$_ => $posn{$_}" for sort { $a <=> $b } keys %posn;
Running this prints:
6 => AATCCA
9 => CCAGAA
10 => CAGAAC
11 => AGAACG
13 => AACGCA
60 => CAATCA
61 => AATCAA
62 => ATCAAA
63 => TCAAAT
Substituting:
$seq=AAATCGAAAAGCDFAAAACGT;
$pat=AATC;
$max_subs=1;
Prints:
1 => AATC
8 => AAGC
15 => AAAC
It is also easy (in the same style as awk) to convert this to 'magic input' from either stdin or a file.
You can also write a similar approach in awk:
echo "AAATCGAAAAGCDFAAAACGT" | awk -v mc=1 -v seq="AATC" '
{
for(i=1; i<=length($1)-length(seq)+1; i++) {
cnt=0
for(j=1;j<=length(seq); j++)
if(substr($1,i+j-1,1)!=substr(seq,j,1)) cnt++
if (cnt<=mc) print i-1 " => " substr($1,i, length(seq))
}
}'
Prints:
1 => AATC
8 => AAGC
15 => AAAC
And the same result with the longer example above. Since the input is moved to STDIN (or a file) and the regex does not need to be HUGE, this should get you started either with Perl or Awk.
(Be aware that the first character of a string is offset 1 in awk and offset 0 in Perl...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With