Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How fast is String.Substring relative to other methods of string processing?

I'm using VB.NET to process a long fixed-length record. The simplest option seems to be loading the whole record into a string and using Substring to access the fields by position and length. But it seems like there will be some redundant processing within the Substring method that happens on every single invocation. That led me to wonder whether I might get better results using a stream- or array-based approach.

The content starts out as a byte array containing UTF8 character data. A couple of other approaches I've thought of are listed below.

  1. Loading the string into a StringReader and reading blocks of it at a time
  2. Converting the byte array into a char array and accessing the characters positionally within the array
  3. (This one seems dumb but I'll throw it out there) Copying the byte array to a memory stream and using a StreamReader

This is definitely premature optimization; the substring approach may be perfectly acceptable even if it's a few milliseconds slower. But I thought I'd ask before coding it, just to see if anyone could think of a reason to use one of the other approaches.

like image 389
John M Gant Avatar asked Dec 29 '22 23:12

John M Gant


2 Answers

The primary cost with substring is the excising of the sub string into a new string. Using Reflector you can see this:

private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy)
{
    if (((startIndex == 0) && (length == this.Length)) && !fAlwaysCopy)
    {
        return this;
    }
    string str = FastAllocateString(length);
    fixed (char* chRef = &str.m_firstChar)
    {
        fixed (char* chRef2 = &this.m_firstChar)
        {
            wstrcpy(chRef, chRef2 + startIndex, length);
        }
    }
    return str;
}

Now to get there (notice that that is not Substring()) it has to go through 5 checks on length and such.

If you are referencing the same substring multiple times then it may well be worth pulling everything out once and dumping the giant string. You will incur overhead in the arrays to store all these substrings.

If it's generally a "one off" access then Substring it, otherwise consider partitioning up. Perhaps System.Data.DataTable would be of use? If you're doing multiple accesses and parsing to other data types then DataTable looks more attractive to me. If you only need one record in memory at a time then a Dictionary<string,object> should be sufficient to hold one record (field names have to be unique).

Alternatively, you could write a custom, generic class that handles fixed-length record reading for you. Indicate the start index of each field and the type of the field. The length of the field is inferred by the start of the next field (exception is the last field which can be inferred from the total record length). The types can be auto-converted using the likes of int.Parse(), double.Parse(), bool.Parse(), etc.

RecordParser r = new RecordParser();
r.AddField("Name", 0, typeof(string));
r.AddField("Age", 48, typeof(int));
r.AddField("SystemId", 58, typeof(Guid));
r.RecordLength(80);

Dictionary<string, object> data = r.Parse(recordString);

If reflection suits your fancy:

[RecordLength(80)]
public class MyRecord
{
    [RecordFieldOffset(0)]
    string Name;

    [RecordFieldOffset(48)]
    int Age;

    [RecordFieldOffset(58)]
    Guid Systemid;
}

Simply run through the properties where you can get the PropertyInfo.PropertyType to know how to deal with the sub string from the record; you can pull out the offsets and total length from the attributes; and return an instance of your class with the data populated. Essentially, you could use reflection to pull out information to call RecordParser.AddField() and RecordLength() from my previous suggestion.

Then wrap it all up into a neat little, no-fuss class:

RecordParser<MyRecord> r = new RecordParser<MyRecord>();
MyRecord data = r.Parse(recordString);

Could even go so far to call r.EnumerateFile("path\to\file") and use the yield return enumeration syntax to parse out records

RecordParser<MyRecord> r = new RecordParser<MyRecord>();
foreach (MyRecord data in r.EnumerateFile("foo.dat"))
{
    // Do stuff with record
}
like image 159
Colin Burnett Avatar answered Mar 09 '23 01:03

Colin Burnett


The fastest method will likely be using the stream technique, because assuming you can read each field sequentially it will only keep what you need in memory and it remembers where you are in the process.

like image 39
Joel Coehoorn Avatar answered Mar 08 '23 23:03

Joel Coehoorn