Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ Speeding up multiple substr() or equivalent function calls for parsing of a large string

I am trying to parse large strings that I have loaded into memory from a file. I am parsing DNA sequences (stored as string) with a sliding window of variable length. The problem is that the strings are so large that it takes a very long time to iterate through them. I don't know if it is even possible but is it possible to speed this up somehow?

I mean I expected I/O to dominate my application so I shift my line by line reading to reading the whole file into memory at once, but after testing my code I found it spends most of the time in a loop like this:

size_t currentCharNumber = 0;
int16_t windowSize = 50;
//seq is a string of length 249250621
while(seq.length() - currentLinePos < windowSize)
{
   string temp = seq.substr(currentLinePos, windowSize);
   //do stuff to temp
   ++currentLinePos;
}

It only take seconds for loading the sequence into memory from a file but takes ~30 mins to parse the sequence (even after commenting out the processing below the substr() call). Is there something I'm missing that is adding a lot of overhead or is it probably due to the size of my data?

Would it be helpful to mention that I can ignore substrings with characters other that ATCG? I mean I do this filtering in my code but only after I get the strings from substr.

This is my first time posting, and my C++ is a little rusty. Any feedback would be greatly appreciated.

like image 373
cjustin Avatar asked Dec 26 '22 18:12

cjustin


2 Answers

You might want to consider switching from using a string to hold your sliding window to using a std::deque<char>. The deque type is optimized for inserting and removing values at either end, and is therefore a good candidate here. You can start by loading the first 50 characters into the deque, and then can adjust your loop like this:

/* Initialize the window to the first windowSize characters. */
std::deque<char> window(seq.begin(), seq.begin() + windowSize);

/* Repeatedly process each window. */
for (size_t i = windowSize; i < seq.length(); ++i) {
    /* Do something to window */

    /* Drop the first character from the window, then add the next character
     * of the sequence.
     */
    window.pop_front();
    window.push_back(seq[i]);     
}

This makes the time to construct each window O(1) instead of O(k), where k is the number of characters in the window. This could dramatically decrease the runtime, since the number of characters in your window is pretty large.

Hope this helps!

like image 180
templatetypedef Avatar answered Dec 30 '22 11:12

templatetypedef


You can delimit your sliding window by two pointers into the original string and work with that instead of copying the entire range into a separate string. If std::string construction is an overhead, avoid it.

You could also re-use the very same std::string instance every time (assuming the window size is constant), but it would still cost you a copy operation (which may be negligible for small window/length ratios, though).

like image 41
Alexander Gessler Avatar answered Dec 30 '22 11:12

Alexander Gessler