Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

IList<T> version of String.IndexOf (finds a sub-'string', not just a single object)

Tags:

c#

.net

algorithm

I'm looking for an implementation of List<T>.IndexOf(List<T>). I've only found List<<T>.IndexOf(T) in the .NET class library.

I have a List longList and a List possibleSubList. I'd like to know if possibleSubList can be found as a sub-'string' within longList, and if so, the index into longList.

This is basically the same semantics as System.String.IndexOf. Anyone know what to call this or if there's a good implementation of it?

Pseudocode Examples:
{1, 2, 3, 9, 8, 7}.IndexOf({3, 9, 8}) = 2
{1, 2, 3, 9, 8, 7}.IndexOf({1, 2, 3, 9, 8, 7}) = 0
{1, 2, 3, 9, 8, 7}.IndexOf({2, 9}) = -1 (not found)

Clarification: I already have a straightforward implementation of this (two nested for loops), but my lists are rather long, and this is in a performance sensitive area. I'm hoping to find a more efficient implementation than my ~O(m*n).

like image 1000
Seth Avatar asked Feb 23 '23 06:02

Seth


2 Answers

Linear Z-Indexing is probably one of the fastest sublist searching algorithm out there today where the pattern is the same and corpus is dynamic, with a true O(n) complexity (with small alphabets, it performs exceptionally better than you might expect from O(n) as ZIndexing provides plenty of opportunities to skip indexes):

I wrote my implementation in a genetics algorithms class under the guidance of Shaojie Zhang from the University of Central Florida. I've adapted the algorithms to C#, and specifically to use generic IList<T>, if you decide to use it, please give credit. The research for these techniques are available here, and specifically, look at the lecture notes here.

At any rate, I've made the code available here

Look inside of TestZIndexing.cs for examples of how to perform searches (in this case on character sequences, but using generics you should be able to use anything with an equality operator).

The usage is simple:

IEnumerable<int> LinearZIndexer.FindZ<T>(
        IList<T> patternSequence, IList<T> sourceSequence, bool bMatchFirstOnly)
        where T: IComparable;

And, as some DNA is circular, I have a circular variant:

IEnumerable<int> LinearZIndexer.FindZCircular<T>(
        IList<T> patternSequence, IList<T> sourceSequence, bool bMatchFirstOnly)
        where T: IComparable;

Let's do it even faster: Suffix Trees

Alternatively, if you want to get even better performance than O(n), you can get O(m), where m is the size of the pattern list, by using a Suffix Tree. This works when the pattern changes and the corpus stays the same (the opposite of the previous case). Look inside the same library I contributed for TestSuffixTree.cs. The only difference here is that you must build the Suffix Tree ahead of time, so it is definitely for multiple pattern searches against a large corpus, but I provide an O(n) and Space(n) algorithm for building that suffix tree.

The calls are equally simple, and again, can use anything that provides an IComparable:

string strTest = "bananabananaorangebananaorangebananabananabananaban";
string[] strFind = {"banana", "orange", "ban"};

// I use char, but you can use any class or primitive that 
// supports IComparable

var tree = new SuffixTree<char>();
tree.BuildTree(strTest.ToCharArray());
var results = tree.Find(str.ToCharArray());
foreach(var r in results) Console.WriteLine(r);

Enjoy.

like image 132
Michael Hays Avatar answered Apr 30 '23 13:04

Michael Hays


Use the string search algorithm: (psuedo code)

findsubstring(list<T> s, list<T> m){
    for(int i=0; i<s.length;++i)
        for(int j=0; j<m.length;++j)
            if(s[i] != s[j])
                break;
            if(j==m.length-1)
                return i;
    return -1;
}
like image 36
nulvinge Avatar answered Apr 30 '23 11:04

nulvinge