I am new to perl and would like to do what I think is some basic string manipulation to DNA sequences stored in an rtf file.
Essentially, my file reads (file is in FASTA format):
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT
What I would like to do is read into my file and print the header (header is >LM1) then match the following DNA sequence GTGCCAGCAGCCGC
and then print the preceding DNA sequence.
So my output would look like this:
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTAC
I have written the following program:
#!/usr/bin/perl
use strict; use warnings;
open(FASTA, "<seq_V3_V6_130227.rtf") or die "The file could not be found.\n";
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
my $header = $_;
print "$header\n";
}
my $dna = <FASTA>;
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print "$dna";
}
}
close(FASTA);
The problem is that my program reads the file line by line and the output I am receiving is the following:
>LM1
GACGGTATCTAACCAGAAAGCCACGGCTAACTAC
Basically I don't know how to assign the entire DNA sequence to my $dna variable and ultimately don't know how to avoid reading the DNA sequence line by line. Also I am getting this warning: Use of uninitialized value $dna in pattern match (m//) at stacked.pl line 14, line 1113.
If anyone could give me some help with writing better code or point me in the correct direction it would be much appreciated.
Using the pos function:
use strict;
use warnings;
my $dna = "";
my $seq = "GTGCCAGCAGCCGC";
while (<DATA>) {
if (/^>/) {
print;
} else {
if (/^[AGCT]/) {
$dna .= $_;
}
}
}
if ($dna =~ /$seq/g) {
print substr($dna, 0, pos($dna) - length($seq)), "\n";
}
__DATA__
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT
You can process a file with multiple entries like so:
while (<DATA>) {
if (/^>/) {
if ($dna =~ /$seq/g) {
print substr($dna, 0, pos($dna) - length($seq)), "\n";
$dna = "";
}
print;
} elsif (/^[AGCT]/) {
$dna .= $_;
}
}
if ($dna && $dna =~ /$seq/g) {
print substr($dna, 0, pos($dna) - length($seq)), "\n";
}
Your while statement reads until the end of file. That means at every loop iteration, $_ is the next line in <FASTA>
. So $dna = <FASTA>
isn't doing what you think it is. It is reading more than you probably want it to.
while(<FASTA>) { #Reads a line here
chomp($_);
if ($_ =~ m/^>/ ) {
my $header = $_;
print "$header\n";
}
$dna = <FASTA> # reads another line here - Causes skips over every other line
}
Now, you need to read the sequence into your $dna
. You can update your while loop with an else
statement. So if its a head line, print it, else, we add it to $dna
.
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
# It is a header line, so print it
my $header = $_;
print "$header\n";
} else {
# if it is not a header line, add to your dna sequence.
$dna .= $_;
}
}
After the loop, you can do your regex.
Note: This solution assumes there is only 1 sequence in the fasta file. If you have more than one, your $dna
variable will have all the sequences as one.
Edit: Adding simple a way to handle multiple sequences
my $dna = "";
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
# Does $dna match the regex?
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print "$1\n";
}
# Reset the sequence
$dna = "";
# It is a header line, so print it
my $header = $_;
print "$header\n";
} else {
# if it is not a header line, add to your dna sequence.
$dna .= $_;
}
}
# Check the last sequence
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print "$1\n";
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With