The following script is for finding one motif in protein sequence.
use strict;
use warnings;
my @file_data=();
my $protein_seq='';
my $h= '[VLIM]';
my $s= '[AG]';
my $x= '[ARNDCEQGHILKMFPSTWYV]';
my $regexp = "($h){4}D($x){4}D"; #motif to be searched is hhhhDxxxxD
my @locations=();
@file_data= get_file_data("seq.txt");
$protein_seq= extract_sequence(@file_data);
#searching for a motif hhhhDxxxxD in each protein sequence in the give file
foreach my $line(@file_data){
if ($motif=~ /$regexp/){
print "found motif \n\n";
} else {
print "not found \n\n";
}
}
#recording the location/position of motif to be outputed
@locations= match_position($regexp,$seq);
if (@locations){
print "Searching for motifs $regexp \n";
print "Catalytic site is at location:\n";
} else {
print "motif not found \n\n";
}
exit;
sub get_file_data{
my ($filename)=@_;
use strict;
use warnings;
my $sequence='';
foreach my $line(@fasta_file_data){
if ($line=~ /^\s*(#.*)?|^>/{
next;
}
else {
$sequence.=$line;
}
}
$sequence=~ s/\s//g;
return $sequence;
}
sub(match_positions) {
my ($regexp, $sequence)=@_;
use strict;
my @position=();
while ($sequence=~ /$regexp/ig){
push (@position, $-[0]);
}
return @position;
}
I am not sure how to extend this for finding multiple motifs (in a fixed order i.e motif1, motif2, motif3) in a given file containing a protein sequence.
A motif is a recurring narrative element with symbolic significance. If you spot a symbol, concept, or plot structure that surfaces repeatedly in the text, you're probably dealing with a motif. They must be related to the central idea of the work, and they always end up reinforcing the author's overall message.
Motif discovery is one of the sequence analysis problems under the application layer and it is one of the significant difficulties in bioinformatics applications. A DNA sequence motif is a subsequence of DNA sequence that is a short similar recurring pattern of nucleotides, and it has many biological functions 1.
Sequence motifs are short, recurring patterns in DNA that are presumed to have a biological function. Often they indicate sequence-specific binding sites for proteins such as nucleases and transcription factors (TF).
You could simply use alternations (delimited by |
) of the sequences. That way each sequence the regex engine can match it will.
/($h{4}D$x{4}D|$x{1,4}A{1,2}$s{2})/
Then you can test this match by looking at $1
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With