Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a file with D

Tags:

parsing

d

dmd

I am new in D and would like to parse a biological file of the form

>name1
acgcgcagagatatagctagatcg
aagctctgctcgcgct
>name2
acgggggcttgctagctcgatagatcga
agctctctttctccttcttcttctagagaga
>name2
gag ggagag

such that I can capture the 'headers' name1,name2,name3 with the corresponding 'sequence' data, the ..acgcg... stuff.

Now i have this.but it will only iterate line by line,

import std.stdio;
import std.stream;
import std.regex;


int main(string[] args){
  auto filename = args[1];
  auto entry_name = regex(r"^>(.*)"); //captures header only
  auto fasta_regex = regex(r"(\>.+\n)([^\>]+\n)"); //captures header and correponding sequence

  try {
    Stream file = new BufferedFile(filename);
    foreach(ulong n, char[] line; file) {
      auto name_capture = match(line,entry_name);
      writeln(name_capture.captures[1]);
    }

    file.close();
  }
  catch (FileException xy){
    writefln("Error reading the file: ");
  }

  catch (Exception xx){
    writefln("Exception occured: " ~ xx.toString());
  }
  return 0;
}

I would like to know a nice way of extracting the header and the sequence data such that I can create an associative array where each item corresponds to an entry in the file

[name1:acgcgcagagatatagctagatcgaagctctgctcgcgct,name2:acgggggcttgctagctcgatagatcgaagctctctttctccttcttcttctagagaga,.....]
like image 509
eastafri Avatar asked Jan 24 '12 19:01

eastafri


2 Answers

the header is on it's own line right? so why not check for it and use an appender to allocate for the value

auto current = std.array.appender!(char[]);
string name;
foreach(ulong n, char[] line; file) {
      auto entry = match(line,entry_name);
      if(entry){//we are in a header line

          if(name){//write what was caught 
              map[name]=current.data.dup;//dup because .current.data is reused
          }
          name = entry.hit.idup;
          current.clear();
      }else{
          current.put(line);
      }
}
map[name]=current.data.dup;//remember last capture

map is where you'll store the values (a string[string] will do)

like image 67
ratchet freak Avatar answered Oct 04 '22 12:10

ratchet freak


Here is my solution without regular expressions (I do not believe for such simple input we need regexp):

import std.stdio;
import std.stream;

int main(string[] args) {
  int ret = 0;
  string fileName = args[1];
  string header;
  char[] sequence;
  string[string] content;
  try {  
    auto file = new BufferedFile(fileName);
    foreach(ulong lineNumber, char[] line; file) {
      if (line[0] == '>') {       
        if (header.length > 0) {
          content[header] = sequence.idup;
          sequence.length = 0;
        } // if
        // we have a new header, and new sequence will start after it
        header = line[1..$].idup;
        content[header] = "";
      } else {
          sequence ~= line;
      } // else
    } // foreach
    content[header] = sequence.idup;
    file.close();
  }
  catch (OpenException oe){
    writefln("Error opening file: " ~ oe.toString());
  }
  catch (Exception e){
    writefln("Exception: " ~ e.toString());
  }
  writeln(content);
  return ret;
} // main() function

/+ -------------------------- BEGIN OUTPUT ------------------------------- +
["name3":"gag ggagag", "name1":"acgcgcagagatatagctagatcgaagctctgctcgcgct", "name2":"acgggggcttgctagctcgatagatcgaagctctctttctccttcttcttctagagaga"]
 + -------------------------- END OUTPUT --------------------------------- +/
like image 36
DejanLekic Avatar answered Oct 04 '22 11:10

DejanLekic