Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast string comparison with list

Tags:

string

c#

list

I need a fast method to determine if a given string is in a list of strings.

The list of strings is not known until runtime but thereafter it will not change.

I could simply have a List<String> called strings and then do:

if (strings.Contains(item))

However this will perform poorly if there are many strings in the list.

I could also use a HashSet<String>, but this would necessitate calling GetHashCode on every incoming string as well as Equals, which would be a waste if there are e.g. only 3 strings in the list. Did I mention this needs to be fast?

I could when setting up, decide to use a List or a HashSet depending on the number of strings (e.g. use List for less than 10 strings, HashSet otherwise), rather like the logic in HybridDictionary.

As the strings are unicode, a standard Trie structure will not work, although a Radix tree/Patricia trie might. Are there any good C# implementations out there with benchmarks?

Some have mentioned bypassing String's GetHashCode and using a faster performing hash function. Are there any benchmarks out there?

Using LINQ expressions to essentially create an optimised switch statement is a novel approach which looks very interesting.

What else would work? Setup cost is not important, just the search speed.

If it matters, the incoming string values will rarely appear in the list.

like image 906
Matt Howells Avatar asked Jul 20 '09 12:07

Matt Howells


4 Answers

You could use a trie to hold the list of strings; tries were designed for fast retrieval. Here's one example of implementing a trie in C#.

Update: Powerpoint presentation on folded tries for Unicode and Ifo on implementation of a folded trie for Unicode (not C#)

like image 146
Vinay Sajip Avatar answered Sep 27 '22 00:09

Vinay Sajip


Check out these:

Jomo Fisher - Fast Switching with LINQ

Gustavo Guerra - StaticStringDictionary - Fast Switching with LINQ revisited

like image 38
Jonathan Rupp Avatar answered Sep 24 '22 00:09

Jonathan Rupp


Have you considered using the HashSet class (in .NET 3) instead?

like image 38
Dan Diplo Avatar answered Sep 24 '22 00:09

Dan Diplo


Re your "when the list is small" concern; if you don't mind using non-generic collections, System.Collections.Specialized.HybridDictionary does something like this; it encapsulates a System.Collections.Specialized.ListDictionary when small, or a System.Collections.Hashtable when it gets bigger (>10). Worth a look?


Otherwise; you could perhaps use HashSet<T> with a custom comparer? Then you can choose how expensive GetHashCode() is...

using System;
using System.Collections.Generic;

class CustomStringComparer : IEqualityComparer<string> {
    public bool Equals(string x, string y) {
        return string.Equals(x, y);
    }
    public int GetHashCode(string s) {
        return string.IsNullOrEmpty(s) ? 0 :
            s.Length + 273133 * (int)s[0];
    }
    private CustomStringComparer() { }
    public static readonly CustomStringComparer Default
        = new CustomStringComparer();
}
static class Program {
    static void Main() {
        HashSet<string> set = new HashSet<string>(
            new string[] { "abc", "def", "ghi" }, CustomStringComparer.Default);
        Console.WriteLine(set.Contains("abc"));
        Console.WriteLine(set.Contains("abcde"));
    }
}
like image 23
Marc Gravell Avatar answered Sep 25 '22 00:09

Marc Gravell