I'm using VB.NET to process a long fixed-length record. The simplest option seems to be loading the whole record into a string and using Substring to access the fields by position and length. But it seems like there will be some redundant processing within the Substring method that happens on every single invocation. That led me to wonder whether I might get better results using a stream- or array-based approach.
The content starts out as a byte array containing UTF8 character data. A couple of other approaches I've thought of are listed below.
This is definitely premature optimization; the substring approach may be perfectly acceptable even if it's a few milliseconds slower. But I thought I'd ask before coding it, just to see if anyone could think of a reason to use one of the other approaches.
The primary cost with substring is the excising of the sub string into a new string. Using Reflector you can see this:
private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy)
{
if (((startIndex == 0) && (length == this.Length)) && !fAlwaysCopy)
{
return this;
}
string str = FastAllocateString(length);
fixed (char* chRef = &str.m_firstChar)
{
fixed (char* chRef2 = &this.m_firstChar)
{
wstrcpy(chRef, chRef2 + startIndex, length);
}
}
return str;
}
Now to get there (notice that that is not Substring()
) it has to go through 5 checks on length and such.
If you are referencing the same substring multiple times then it may well be worth pulling everything out once and dumping the giant string. You will incur overhead in the arrays to store all these substrings.
If it's generally a "one off" access then Substring it, otherwise consider partitioning up. Perhaps System.Data.DataTable
would be of use? If you're doing multiple accesses and parsing to other data types then DataTable
looks more attractive to me. If you only need one record in memory at a time then a Dictionary<string,object>
should be sufficient to hold one record (field names have to be unique).
Alternatively, you could write a custom, generic class that handles fixed-length record reading for you. Indicate the start index of each field and the type of the field. The length of the field is inferred by the start of the next field (exception is the last field which can be inferred from the total record length). The types can be auto-converted using the likes of int.Parse()
, double.Parse()
, bool.Parse()
, etc.
RecordParser r = new RecordParser();
r.AddField("Name", 0, typeof(string));
r.AddField("Age", 48, typeof(int));
r.AddField("SystemId", 58, typeof(Guid));
r.RecordLength(80);
Dictionary<string, object> data = r.Parse(recordString);
If reflection suits your fancy:
[RecordLength(80)]
public class MyRecord
{
[RecordFieldOffset(0)]
string Name;
[RecordFieldOffset(48)]
int Age;
[RecordFieldOffset(58)]
Guid Systemid;
}
Simply run through the properties where you can get the PropertyInfo.PropertyType
to know how to deal with the sub string from the record; you can pull out the offsets and total length from the attributes; and return an instance of your class with the data populated. Essentially, you could use reflection to pull out information to call RecordParser.AddField() and RecordLength() from my previous suggestion.
Then wrap it all up into a neat little, no-fuss class:
RecordParser<MyRecord> r = new RecordParser<MyRecord>();
MyRecord data = r.Parse(recordString);
Could even go so far to call r.EnumerateFile("path\to\file")
and use the yield return
enumeration syntax to parse out records
RecordParser<MyRecord> r = new RecordParser<MyRecord>();
foreach (MyRecord data in r.EnumerateFile("foo.dat"))
{
// Do stuff with record
}
The fastest method will likely be using the stream technique, because assuming you can read each field sequentially it will only keep what you need in memory and it remembers where you are in the process.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With