Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast string parsing in C#

Tags:

c#

parsing

What's the fastest way to parse strings in C#?

Currently I'm just using string indexing (string[index]) and the code runs reasonably, but I can't help but think that the continuous range checking that the index accessor does must be adding something.

So, I'm wondering what techniques I should consider to give it a boost. These are my initial thoughts/questions:

  1. Use methods like string.IndexOf() and IndexOfAny() to find characters of interest. Are these faster than manually scanning a string by string[index]?
  2. Use regex's. Personally, I don't like regex as I find them difficult to maintain, but are these likely to be faster than manually scanning the string?
  3. Use unsafe code and pointers. This would eliminate the index range checking but I've read that unsafe code wont run in untrusted environments. What exactly are the implications of this? Does this mean the whole assembly won't load/run, or will only the code marked unsafe refuse to run? The library could potentially be used in a number of environments, so to be able to fall back to a slower but more compatible mode would be nice.
  4. What else might I consider?

NB: I should say, the strings I'm parsing could be reasonably large (say 30k) and in a custom format for which there is no standard .NET parser. Also, performance of this code is not super critical, so this partly just a theoretical question of curiosity.

like image 819
Brad Robinson Avatar asked Jul 01 '10 12:07

Brad Robinson


2 Answers

30k is not what I would consider to be large. Before getting excited, I would profile. The indexer should be fine for the best balance of flexibility and safety.

For example, to create a 128k string (and a separate array of the same size), fill it with junk (including the time to handle Random) and sum all the character code-points via the indexer takes... 3ms:

        var watch = Stopwatch.StartNew();
        char[] chars = new char[128 * 1024];
        Random rand = new Random(); // fill with junk
        for (int i = 0; i < chars.Length; i++) chars[i] =
             (char) ((int) 'a' + rand.Next(26));

        int sum = 0;
        string s = new string(chars);
        int len = s.Length;
        for(int i = 0 ; i < len ; i++)
        {
            sum += (int) chars[i];
        }
        watch.Stop();
        Console.WriteLine(sum);
        Console.WriteLine(watch.ElapsedMilliseconds + "ms");
        Console.ReadLine();

For files that are actually large, a reader approach should be used - StreamReader etc.

like image 70
Marc Gravell Avatar answered Oct 18 '22 15:10

Marc Gravell


"Parsing" is quite an inexact term. Since you talks of 30k, it seems that you might be dealing with some sort of structured string which can be covered by creating a parser using a parser generator tool.

A nice tool to create, maintain and understand the whole process is the GOLD Parsing System by Devin Cook: http://www.devincook.com/goldparser/

This can help you create code which is efficient and correct for many textual parsing needs.

As for your points:

  1. is usually not useful for parsing which goes further than splitting a string.

  2. is better suited if there are no recursions or too complex rules.

  3. is basically a no-go if you haven't really identified this as a serious problem. The JIT can take care of doing the range checks only when needed, and indeed for simple loops (the typical for loop) this is handled pretty well.

like image 39
Lucero Avatar answered Oct 18 '22 16:10

Lucero