I have a large FASTA file (a genetic sequence, an entire chromosome), where each line contains 50 characters (bases a,g,t, and c). There are about 4 million lines in this file.
I want to reorganize the file so that each character of a line is placed in its own line of a new file. That is, turn each 50-character line in the original file into 50, single-character lines. This will result in the entire sequence rewritten as a single column. Ultimately, I want the sequence as a single column so I can then place an adjacent column containing the genomic coordinate position for each base.
This is how I am doing it, using perl and creating a set of for loops.
unless(@ARGV) {
# $0 name of the program being executed;
print "\n usage: $0 filename\n\n";
exit;
}
# use shift to pull off @ARGV value and return to $list;
my $fastafile = shift;
open(FASTA, "<$fastafile");
my @count =(<FASTA>);
close FASTA;
# print scalar @count;
for ( my $i = 0; $i < scalar @count ; $i ++ ) {
#print "$count[$i]\n\n\n\n";
my @seq = split( "", $count[ $i ] );
print " line = $i ";
for ( my $j = 0; $j < scalar @seq; $j++ ){
#my $count =
print "$seq[$j] for count = $j \n";
}
}
It seems to be working, but it is being slow, very slow. I am wondering if it is slow because the FASTA file has 4 million lines, or it is slow because of my code, or both. I am looking for advice to speed up this process. Thanks!
The problem is that you are slurping the file. While the huge file is being slurped, the process will wait until all the I/O is over to start processing. An option is to process the file line by line:
open my $fh, '<', $fastafile or die "Error opening file: $!";
while ( my $line = <$fh> ) {
chomp $line; # Remove the newline from the end of each line
my @seq = split //, $line;
# Loop from 0 to the last index of @seq
for my $i ( 0 .. $#seq ) {
print "$seq[$i] for count = $i\n";
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With