I'm looking for an implementation of List<T>.IndexOf(List<T>)
. I've only found List<<T>.IndexOf(T)
in the .NET class library.
I have a List longList
and a List possibleSubList
. I'd like to know if possibleSubList
can be found as a sub-'string' within longList
, and if so, the index into longList
.
This is basically the same semantics as System.String.IndexOf
. Anyone know what to call this or if there's a good implementation of it?
Pseudocode Examples:
{1, 2, 3, 9, 8, 7}.IndexOf({3, 9, 8}) = 2
{1, 2, 3, 9, 8, 7}.IndexOf({1, 2, 3, 9, 8, 7}) = 0
{1, 2, 3, 9, 8, 7}.IndexOf({2, 9}) = -1 (not found)
Clarification: I already have a straightforward implementation of this (two nested for loops), but my lists are rather long, and this is in a performance sensitive area. I'm hoping to find a more efficient implementation than my ~O(m*n).
Linear Z-Indexing is probably one of the fastest sublist searching algorithm out there today where the pattern is the same and corpus is dynamic, with a true O(n) complexity (with small alphabets, it performs exceptionally better than you might expect from O(n) as ZIndexing provides plenty of opportunities to skip indexes):
I wrote my implementation in a genetics algorithms class under the guidance of Shaojie Zhang from the University of Central Florida. I've adapted the algorithms to C#, and specifically to use generic IList<T>
, if you decide to use it, please give credit. The research for these techniques are available here, and specifically, look at the lecture notes here.
At any rate, I've made the code available here
Look inside of TestZIndexing.cs for examples of how to perform searches (in this case on character sequences, but using generics you should be able to use anything with an equality operator).
The usage is simple:
IEnumerable<int> LinearZIndexer.FindZ<T>(
IList<T> patternSequence, IList<T> sourceSequence, bool bMatchFirstOnly)
where T: IComparable;
And, as some DNA is circular, I have a circular variant:
IEnumerable<int> LinearZIndexer.FindZCircular<T>(
IList<T> patternSequence, IList<T> sourceSequence, bool bMatchFirstOnly)
where T: IComparable;
Let's do it even faster: Suffix Trees
Alternatively, if you want to get even better performance than O(n), you can get O(m), where m is the size of the pattern list, by using a Suffix Tree. This works when the pattern changes and the corpus stays the same (the opposite of the previous case). Look inside the same library I contributed for TestSuffixTree.cs
. The only difference here is that you must build the Suffix Tree ahead of time, so it is definitely for multiple pattern searches against a large corpus, but I provide an O(n) and Space(n) algorithm for building that suffix tree.
The calls are equally simple, and again, can use anything that provides an IComparable:
string strTest = "bananabananaorangebananaorangebananabananabananaban";
string[] strFind = {"banana", "orange", "ban"};
// I use char, but you can use any class or primitive that
// supports IComparable
var tree = new SuffixTree<char>();
tree.BuildTree(strTest.ToCharArray());
var results = tree.Find(str.ToCharArray());
foreach(var r in results) Console.WriteLine(r);
Enjoy.
Use the string search algorithm: (psuedo code)
findsubstring(list<T> s, list<T> m){
for(int i=0; i<s.length;++i)
for(int j=0; j<m.length;++j)
if(s[i] != s[j])
break;
if(j==m.length-1)
return i;
return -1;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With