Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest, Efficient, Elegant way of Parsing Strings to Dynamic types?

I'm looking for the fastest (generic approach) to converting strings into various data types on the go.

I am parsing large text data files generated by a something (files are several megabytes in size). This particulare function reads lines in the text file, parses each line into columns based on delimitters and places the parsed values into a .NET DataTable. This is later inserted into a database. My bottleneck by FAR is the string conversions (Convert and TypeConverter).

I have to go with a dynamic way (i.e. staying away form "Convert.ToInt32" etc...) because I never know what types are going to be in the files. The type is determined by earlier configuration during runtime.

So far I have tried the following and both take several minutes to parse a file. Note that if I comment out this one line it runs in only a few hundred milliseconds.

row[i] = Convert.ChangeType(columnString, dataType);

AND

TypeConverter typeConverter = TypeDescriptor.GetConverter(type);
row[i] = typeConverter.ConvertFromString(null, cultureInfo, columnString);

If anyone knows of a faster way that is generic like this I would like to know about it. Or if my whole approach just sucks for some reason I'm open to suggestions. But please don't point me to non-generic approaches using hard coded types; that is simply not an option here.

UPDATE - Multi-threading to Improve Performance Test

In order to improve performance I have looked into splitting up parsing tasks to multiple threads. I found that the speed increased somewhat but still not as much as I had hoped. However, here are my results for those who are interested.

System:

Intel Xenon 3.3GHz Quad Core E3-1245

Memory: 12.0 GB

Windows 7 Enterprise x64

Test:

The test function is this:

(1) Receive an array of strings. (2) Split the string by delimitters. (3) Parse strings into data types and store them in a row. (4) Add row to data table. (5) Repeat (2)-(4) until finished.

The test included 1000 strings, each string being parsed into 16 columns, so that is 16000 string conversions total. I tested single thread, 4 threads (because of quad core), and 8 threads (because of hyper-threading). Since I'm only crunching data here I doubt adding more threads than this would do any good. So for the single thread it parses 1000 strings, 4 threads parse 250 strings each, and 8 threads parse 125 strings each. Also I tested a few different ways of using threads: thread creation, thread pool, tasks, and function objects.

Results: Result times are in Milliseconds.

Single Thread:

  • Method Call: 17720

4 Threads

  • Parameterized Thread Start: 13836
  • ThreadPool.QueueUserWorkItem: 14075
  • Task.Factory.StartNew: 16798
  • Func BeginInvoke EndInvoke: 16733

8 Threads

  • Parameterized Thread Start: 12591
  • ThreadPool.QueueUserWorkItem: 13832
  • Task.Factory.StartNew: 15877
  • Func BeginInvoke EndInvoke: 16395

As you can see the fastest is using Parameterized Thread Start with 8 threads (the number of my logical cores). However it does not beat using 4 threads by much and is only about 29% faster than using a single core. Of course results will vary by machine. Also I stuck with a

    Dictionary<Type, TypeConverter>

cache for string parsing as using arrays of type converters did not offer a noticeable performance increase and having one shared cached type converter is more maintainable rather than creating arrays all over the place when I need them.

ANOTHER UPDATE:

Ok so I ran some more tests to see if I could squeeze some more performance out and I found some interesting things. I decided to stick with 8 threads, all started from the Parameterized Thread Start method (which was the fastest of my previous tests). The same test as above was run, just with different parsing algorithms. I noticed that

    Convert.ChangeType and TypeConverter

take about the same amount of time. Type specific converters like

    int.TryParse

are slightly faster but not an option for me since my types are dynamic. ricovox had some good advice about exception handling. My data does indeed have invalid data, some integer columns will put a dash '-' for empty numbers, so type converters blow up at that: meaning every row I parse I have at least one exception, thats 1000 exceptions! Very time consuming.

Btw this is how I do my conversions with TypeConverter. Extensions is just a static class and GetTypeConverter just returns a cahced TypeConverter. If an exceptions is thrown during the conversion, a default value is used.

public static Object ConvertTo(this String arg, CultureInfo cultureInfo, Type type, Object defaultValue)
{
  Object value;
  TypeConverter typeConverter = Extensions.GetTypeConverter(type);

  try
  {
    // Try converting the string.
    value = typeConverter.ConvertFromString(null, cultureInfo, arg);
  }
  catch
  {
    // If the conversion fails then use the default value.
    value = defaultValue;
  }

  return value;
}

Results:

Same test on 8 threads - parse 1000 lines, 16 columns each, 250 lines per thread.

So I did 3 new things.

1 - Run the test: check for known invalid types before parsing to minimize exceptions. i.e. if(!Char.IsDigit(c)) value = 0; OR columnString.Contains('-') etc...

Runtime: 29ms

2 - Run the test: use custom parsing algorithms that have try catch blocks.

Runtime: 12424ms

3 - Run the test: use custom parsing algorithms checking for invalid types before parsing to minimize exceptions.

Runtime 15ms

Wow! As you can see eliminating the exceptions made a world of difference. I never realized how expensive exceptions really were! So If I minimize my exceptions to TRULY unknown cases, then the parsing algorithm runs three orders of magnitude faster. I'm considering this absolutely solved. I believe I will keep the dynamic type conversion with TypeConverter, it is only a few milliseconds slower. Checking for known invalid types before converting avoids exceptions and that speeds things up incredibly! Thanks to ricovox for pointing that out which made me test this further.

like image 427
akagixxer Avatar asked Dec 13 '12 19:12

akagixxer


1 Answers

if you are primarily going to be converting the strings to the native data types (string, int, bool, DateTime etc) you could use something like the code below, which caches the TypeCodes and TypeConverters (for non-native types) and uses a fast switch statement to quickly jump to the appropriate parsing routine. This should save some time over Convert.ChangeType because the source type (string) is already known, and you can directly call the right parse method.

/* Get an array of Types for each of your columns.
 * Open the data file for reading.
 * Create your DataTable and add the columns.
 * (You have already done all of these in your earlier processing.)
 * 
 * Note:    For the sake of generality, I've used an IEnumerable<string> 
 * to represent the lines in the file, although for large files,
 * you would use a FileStream or TextReader etc.
*/      
IList<Type> columnTypes;        //array or list of the Type to use for each column
IEnumerable<string> fileLines;  //the lines to parse from the file.
DataTable table;                //the table you'll add the rows to

int colCount = columnTypes.Count;
var typeCodes = new TypeCode[colCount];
var converters = new TypeConverter[colCount];
//Fill up the typeCodes array with the Type.GetTypeCode() of each column type.
//If the TypeCode is Object, then get a custom converter for that column.
for(int i = 0; i < colCount; i++) {
    typeCodes[i] = Type.GetTypeCode(columnTypes[i]);
    if (typeCodes[i] == TypeCode.Object)
        converters[i] = TypeDescriptor.GetConverter(columnTypes[i]);
}

//Probably faster to build up an array of objects and insert them into the row all at once.
object[] vals = new object[colCount];
object val;
foreach(string line in fileLines) {
    //delineate the line into columns, however you see fit. I'll assume a tab character.
    var columns = line.Split('\t');
    for(int i = 0; i < colCount) {
        switch(typeCodes[i]) {
            case TypeCode.String:
                val = columns[i]; break;
            case TypeCode.Int32:
                val = int.Parse(columns[i]); break;
            case TypeCode.DateTime:
                val = DateTime.Parse(columns[i]); break;
            //...list types that you expect to encounter often.

            //finally, deal with other objects
            case TypeCode.Object:
            default:
                val = converters[i].ConvertFromString(columns[i]);
                break;
        }
        vals[i] = val;
    }
    //Add all values to the row at one time. 
    //This might be faster than adding each column one at a time.
    //There are two ways to do this:
    var row = table.Rows.Add(vals); //create new row on the fly.
    // OR 
    row.ItemArray = vals; //(e.g. allows setting existing row, created previously)
}

There really ISN'T any other way that would be faster, because we're basically just using the raw string parsing methods defined by the types themselves. You could re-write your own parsing code for each output type yourself, making optimizations for the exact formats you'll encounter. But I assume that is overkill for your project. It would probably be better and faster to simply tailor the FormatProvider or NumberStyles in each case.

For example let's say that whenever you parse Double values, you know, based on your proprietary file format, that you won't encounter any strings that contain exponents etc, and you know that there won't be any leading or trailing space, etc. So you can clue the parser in to these things with the NumberStyles argument as follows:

//NOTE:   using System.Globalization;
var styles = NumberStyles.AllowDecimalPoint | NumberStyles.AllowLeadingSign;
var d = double.Parse(text, styles);

I don't know for a fact how the parsing is implemented, but I would think that the NumberStyles argument allows the parsing routine to work faster by excluding various formatting possibilities. Of course, if you can't make any assumptions about the format of the data, then you won't be able to make these types of optimizations.

Of course, there's always the possibility that your code is slow simply because it takes time to parse a string into a certain data type. Use a performance analyzer (like in VS2010) to try to see where your actual bottleneck is. Then you'll be able to optimize better, or simply give up, e.g. in the case that there is noting else to do short of writing the parsing routines in assembly :-)

like image 145
drwatsoncode Avatar answered Nov 08 '22 13:11

drwatsoncode