basic regex and string manipulation for DNA analysis using perl

Question

I am new to perl and would like to do what I think is some basic string manipulation to DNA sequences stored in an rtf file.

Essentially, my file reads (file is in FASTA format):

>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT

What I would like to do is read into my file and print the header (header is >LM1) then match the following DNA sequence GTGCCAGCAGCCGC and then print the preceding DNA sequence.
So my output would look like this:

>LM1 
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTAC

I have written the following program:

#!/usr/bin/perl

use strict; use warnings;

open(FASTA, "<seq_V3_V6_130227.rtf") or die "The file could not be found.
";

while(<FASTA>) {
    chomp($_);
    if ($_ =~  m/^>/ ) {
        my $header = $_;
        print "$header
";
    }

    my $dna = <FASTA>;
    if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
        print "$dna";
    }

}
close(FASTA);

The problem is that my program reads the file line by line and the output I am receiving is the following:

>LM1
GACGGTATCTAACCAGAAAGCCACGGCTAACTAC

Basically I don't know how to assign the entire DNA sequence to my $dna variable and ultimately don't know how to avoid reading the DNA sequence line by line. Also I am getting this warning: Use of uninitialized value $dna in pattern match (m//) at stacked.pl line 14, line 1113.

If anyone could give me some help with writing better code or point me in the correct direction it would be much appreciated.

perreal · Accepted Answer

Using the pos function:

use strict;
use warnings;

my $dna = "";
my $seq = "GTGCCAGCAGCCGC";
while (<DATA>) {
  if (/^>/) {
    print;
  } else {
    if (/^[AGCT]/) {
      $dna .= $_;
    }
  }

}

if ($dna =~ /$seq/g) {
  print substr($dna, 0, pos($dna) - length($seq)), "
";
}

__DATA__
>LM1

AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT

You can process a file with multiple entries like so:

while (<DATA>) {
  if (/^>/) {
    if ($dna =~ /$seq/g) {
      print substr($dna, 0, pos($dna) - length($seq)), "
";
      $dna = ""; 
    }   
    print;
  } elsif (/^[AGCT]/) {
    $dna .= $_; 
  }   
}

if ($dna && $dna =~ /$seq/g) {
  print substr($dna, 0, pos($dna) - length($seq)), "
";
}

Nate · Answer

Your while statement reads until the end of file. That means at every loop iteration, $_ is the next line in <FASTA>. So $dna = <FASTA> isn't doing what you think it is. It is reading more than you probably want it to.

while(<FASTA>) { #Reads a line here
  chomp($_);
  if ($_ =~  m/^>/ ) {
    my $header = $_;
    print "$header
";
  }
  $dna = <FASTA> # reads another line here - Causes skips over every other line
}

Now, you need to read the sequence into your $dna. You can update your while loop with an else statement. So if its a head line, print it, else, we add it to $dna.

while(<FASTA>) {
  chomp($_);
  if ($_ =~  m/^>/ ) {
    # It is a header line, so print it
    my $header = $_;
    print "$header
";
  } else {
    # if it is not a header line, add to your dna sequence.
    $dna .= $_;
  }
}

After the loop, you can do your regex.

Note: This solution assumes there is only 1 sequence in the fasta file. If you have more than one, your $dna variable will have all the sequences as one.

Edit: Adding simple a way to handle multiple sequences

my $dna = "";
while(<FASTA>) {
  chomp($_);
  if ($_ =~  m/^>/ ) {

    # Does $dna match the regex?
    if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
      print "$1
";
    }

    # Reset the sequence
    $dna = "";

    # It is a header line, so print it
    my $header = $_;
    print "$header
";

  } else {
    # if it is not a header line, add to your dna sequence.
    $dna .= $_;
  }
}

# Check the last sequence
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
  print "$1
";
}

basic regex and string manipulation for DNA analysis using perl

Tags:

perl

cebach561

2 Answers

perreal

Nate

Recent Activity

Donate For Us

basic regex and string manipulation for DNA analysis using perl

Tags:

perl

cebach561

2 Answers

perreal

Nate

Related questions

Recent Activity

Donate For Us