Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way to make sure string does not contain specific characters

I want to make sure that C# string does not contain specific characters.

I am using string.IndexOfAny(char[]), in my opinion Regex would be slower in this task. Is there a better way to accomplish this? Speed is critical in my application.

like image 463
EOG Avatar asked Mar 22 '23 06:03

EOG


1 Answers

Ran a quick benchmark on IndexOf vs IndexOfAny vs Regex vs Hashset.
500 word lorem ipsum haystack, with two character needle.
Tested with both needles in haystack, one in haystack, neither in haystack.

    private long TestIndexOf(string haystack, char[] needles)
    {
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();
        for (int i = 0; i < 1000000; i++)
        {
            int x = haystack.IndexOfAny(needles);
        }
        sw.Stop();

        return sw.ElapsedMilliseconds;
    }

    private long TestRegex(string haystack, char[] needles)
    {
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();
        Regex regex = new Regex(string.Join("|", needles));
        for (int i = 0; i < 1000000; i++)
        {
            Match m = regex.Match(haystack);
        }
        sw.Stop();

        return sw.ElapsedMilliseconds;
    }

    private long TestIndexOf(string haystack, char[] needles)
    {
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();
        for (int i = 0; i < 1000000; i++)
        {
            int x = haystack.IndexOf(needles[0]);
        }
        sw.Stop();

        return sw.ElapsedMilliseconds;
    }

    private long TestHashset(string haystack, char[] needles)
    {
        HashSet<char> specificChars = new HashSet<char>(needles.ToList());
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();
        for (int i = 0; i < 1000000; i++)
        {
            bool notContainsSpecificChars = !haystack.Any(specificChars.Contains);
        }
        sw.Stop();

        return sw.ElapsedMilliseconds;
    }

Result for 1,000,000 iterations:

Index of : 28/2718/2711
Index of Any : 153/141/17561
Regex of : 1068/1102/92324
Hashset : 939/891/111702

Notes:

  • Smaller haystack improves performance.
  • Larger needle set increases regex performance.
  • Larger needle set reduces indexofany performance.
  • If needle is not in haystack all methods degrade in performance

Overall, regex is slower then indexofany by upto 10x depending on the haystack and needle sizes.

like image 78
Kami Avatar answered Apr 06 '23 17:04

Kami