Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse without string split

Tags:

c#

parsing

This is a spin-off from the discussion in some other question.

Suppose I've got to parse a huge number of very long strings. Each string contains a sequence of doubles (in text representation, of course) separated by whitespace. I need to parse the doubles into a List<double>.

The standard parsing technique (using string.Split + double.TryParse) seems to be quite slow: for each of the numbers we need to allocate a string.

I tried to make it old C-like way: compute the indices of the beginning and the end of substrings containing the numbers, and parse it "in place", without creating additional string. (See http://ideone.com/Op6h0, below shown the relevant part.)

int startIdx, endIdx = 0;
while(true)
{
    startIdx = endIdx;
    // no find_first_not_of in C#
    while (startIdx < s.Length && s[startIdx] == ' ') startIdx++;
    if (startIdx == s.Length) break;
    endIdx = s.IndexOf(' ', startIdx);
    if (endIdx == -1) endIdx = s.Length;
    // how to extract a double here?
}

There is an overload of string.IndexOf, searching only within a given substring, but I failed to find a method for parsing a double from substring, without actually extracting that substring first.

Does anyone have an idea?

like image 532
Vlad Avatar asked Apr 15 '12 11:04

Vlad


1 Answers

There is no managed API to parse a double from a substring. My guess is that allocating the string will be insignificant compared to all the floating point operations in double.Parse.

Anyway, you can save the allocation by creating a "buffer" string once of length 100 consisting of whitespace only. Then, for every string you want to parse, you copy the chars into this buffer string using unsafe code. You fill the buffer string with whitespace. And for parsing you can use NumberStyles.AllowTrailingWhite which will cause trailing whitespace to be ignored.

Getting a pointer to string is actually a fully supported operation:

    string l_pos = new string(' ', 100); //don't write to a shared string!
    unsafe 
    {
        fixed (char* l_pSrc = l_pos)
        {               
              // do some work
        }
    }

C# has special syntax to bind a string to a char*.

like image 194
usr Avatar answered Oct 19 '22 03:10

usr