Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

basic regex and string manipulation for DNA analysis using perl

Tags:

perl

I am new to perl and would like to do what I think is some basic string manipulation to DNA sequences stored in an rtf file.

Essentially, my file reads (file is in FASTA format):

>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT

What I would like to do is read into my file and print the header (header is >LM1) then match the following DNA sequence GTGCCAGCAGCCGC and then print the preceding DNA sequence.
So my output would look like this:

>LM1 
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTAC

I have written the following program:

#!/usr/bin/perl

use strict; use warnings;

open(FASTA, "<seq_V3_V6_130227.rtf") or die "The file could not be found.\n";

while(<FASTA>) {
    chomp($_);
    if ($_ =~  m/^>/ ) {
        my $header = $_;
        print "$header\n";
    }

    my $dna = <FASTA>;
    if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
        print "$dna";
    }

}
close(FASTA);

The problem is that my program reads the file line by line and the output I am receiving is the following:

>LM1
GACGGTATCTAACCAGAAAGCCACGGCTAACTAC

Basically I don't know how to assign the entire DNA sequence to my $dna variable and ultimately don't know how to avoid reading the DNA sequence line by line. Also I am getting this warning: Use of uninitialized value $dna in pattern match (m//) at stacked.pl line 14, line 1113.

If anyone could give me some help with writing better code or point me in the correct direction it would be much appreciated.

like image 899
cebach561 Avatar asked Mar 04 '13 20:03

cebach561


2 Answers

Using the pos function:

use strict;
use warnings;

my $dna = "";
my $seq = "GTGCCAGCAGCCGC";
while (<DATA>) {
  if (/^>/) {
    print;
  } else {
    if (/^[AGCT]/) {
      $dna .= $_;
    }
  }

}

if ($dna =~ /$seq/g) {
  print substr($dna, 0, pos($dna) - length($seq)), "\n";
}

__DATA__
>LM1

AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT

You can process a file with multiple entries like so:

while (<DATA>) {
  if (/^>/) {
    if ($dna =~ /$seq/g) {
      print substr($dna, 0, pos($dna) - length($seq)), "\n";
      $dna = ""; 
    }   
    print;
  } elsif (/^[AGCT]/) {
    $dna .= $_; 
  }   
}

if ($dna && $dna =~ /$seq/g) {
  print substr($dna, 0, pos($dna) - length($seq)), "\n";
}
like image 65
perreal Avatar answered Nov 08 '22 12:11

perreal


Your while statement reads until the end of file. That means at every loop iteration, $_ is the next line in <FASTA>. So $dna = <FASTA> isn't doing what you think it is. It is reading more than you probably want it to.

while(<FASTA>) { #Reads a line here
  chomp($_);
  if ($_ =~  m/^>/ ) {
    my $header = $_;
    print "$header\n";
  }
  $dna = <FASTA> # reads another line here - Causes skips over every other line
}

Now, you need to read the sequence into your $dna. You can update your while loop with an else statement. So if its a head line, print it, else, we add it to $dna.

while(<FASTA>) {
  chomp($_);
  if ($_ =~  m/^>/ ) {
    # It is a header line, so print it
    my $header = $_;
    print "$header\n";
  } else {
    # if it is not a header line, add to your dna sequence.
    $dna .= $_;
  }
}

After the loop, you can do your regex.

Note: This solution assumes there is only 1 sequence in the fasta file. If you have more than one, your $dna variable will have all the sequences as one.

Edit: Adding simple a way to handle multiple sequences

my $dna = "";
while(<FASTA>) {
  chomp($_);
  if ($_ =~  m/^>/ ) {

    # Does $dna match the regex?
    if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
      print "$1\n";
    }

    # Reset the sequence
    $dna = "";

    # It is a header line, so print it
    my $header = $_;
    print "$header\n";

  } else {
    # if it is not a header line, add to your dna sequence.
    $dna .= $_;
  }
}

# Check the last sequence
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
  print "$1\n";
}
like image 2
Nate Avatar answered Nov 08 '22 12:11

Nate