I am trying to parse large strings that I have loaded into memory from a file. I am parsing DNA sequences (stored as string) with a sliding window of variable length. The problem is that the strings are so large that it takes a very long time to iterate through them. I don't know if it is even possible but is it possible to speed this up somehow?
I mean I expected I/O to dominate my application so I shift my line by line reading to reading the whole file into memory at once, but after testing my code I found it spends most of the time in a loop like this:
size_t currentCharNumber = 0;
int16_t windowSize = 50;
//seq is a string of length 249250621
while(seq.length() - currentLinePos < windowSize)
{
string temp = seq.substr(currentLinePos, windowSize);
//do stuff to temp
++currentLinePos;
}
It only take seconds for loading the sequence into memory from a file but takes ~30 mins to parse the sequence (even after commenting out the processing below the substr() call). Is there something I'm missing that is adding a lot of overhead or is it probably due to the size of my data?
Would it be helpful to mention that I can ignore substrings with characters other that ATCG? I mean I do this filtering in my code but only after I get the strings from substr.
This is my first time posting, and my C++ is a little rusty. Any feedback would be greatly appreciated.
You might want to consider switching from using a string
to hold your sliding window to using a std::deque<char>
. The deque
type is optimized for inserting and removing values at either end, and is therefore a good candidate here. You can start by loading the first 50 characters into the deque
, and then can adjust your loop like this:
/* Initialize the window to the first windowSize characters. */
std::deque<char> window(seq.begin(), seq.begin() + windowSize);
/* Repeatedly process each window. */
for (size_t i = windowSize; i < seq.length(); ++i) {
/* Do something to window */
/* Drop the first character from the window, then add the next character
* of the sequence.
*/
window.pop_front();
window.push_back(seq[i]);
}
This makes the time to construct each window O(1) instead of O(k), where k
is the number of characters in the window. This could dramatically decrease the runtime, since the number of characters in your window is pretty large.
Hope this helps!
You can delimit your sliding window by two pointers into the original string and work with that instead of copying the entire range into a separate string. If std::string
construction is an overhead, avoid it.
You could also re-use the very same std::string
instance every time (assuming the window size is constant), but it would still cost you a copy operation (which may be negligible for small window/length ratios, though).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With