I am looking for ways to find matching patterns in lists or arrays of strings, specifically in .NET, but algorithms or logic from other languages would be helpful. Say I have 3 arrays (or in this specific case List(Of String)) <pre class="prettyprint"><code>Array1 "Do" "Re" "Mi" "Fa" "So" "La" "Ti" Array2 "Mi" "Fa" "Jim" "Bob" "So" Array3 "Jim" "Bob" "So" "La" "Ti" </code></pre> I want to report on the occurrences of the matches of <pre class="prettyprint"><code>("Mi", "Fa") In Arrays (1,2) ("So") In Arrays (1,2,3) ("Jim", "Bob", "So") in Arrays (2,3) ("So", "La", "Ti") in Arrays (1, 3) </code></pre> ...and any others. I am using this to troubleshoot an issue, not to make a commercial product of it specifically, and would rather not do it by hand (there are 110 lists of about 100-200 items). Are there any algorithms, existing code, or ideas that will help me accomplish finding the results described?

As others have mentioned the function you want is Intersect. If you are using .NET 3.0 consider using LINQ's Intersect function. See the following post for more information Consider using LinqPAD to experiment. www.linqpad.net

How to find similar patterns in lists/arrays of strings

Tags:

string

pattern-matching

I am looking for ways to find matching patterns in lists or arrays of strings, specifically in .NET, but algorithms or logic from other languages would be helpful.

Say I have 3 arrays (or in this specific case List(Of String))

Array1
"Do"
"Re"
"Mi"
"Fa"
"So"
"La"
"Ti"

Array2
"Mi"
"Fa"
"Jim"
"Bob"
"So"

Array3
"Jim"
"Bob"
"So"
"La"
"Ti"

I want to report on the occurrences of the matches of

("Mi", "Fa") In Arrays (1,2)
("So") In Arrays (1,2,3)
("Jim", "Bob", "So") in Arrays (2,3)
("So", "La", "Ti") in Arrays (1, 3)

...and any others.

I am using this to troubleshoot an issue, not to make a commercial product of it specifically, and would rather not do it by hand (there are 110 lists of about 100-200 items).

Are there any algorithms, existing code, or ideas that will help me accomplish finding the results described?

469

asked Jan 27 '09 13:01

StingyJack

3 Answers

The simplest way to code would be to build a Dictionary then loop through each item in each array. For each item do this:

Check if the item is in the dictonary if so add the list to the array. If the item is not in the dictionary add it and the list.

Since as you said this is non-production code performance doesn't matter so this approach should work fine.

104

answered Dec 15 '22 04:12

JoshBerke

Here's a solution using SuffixTree module to locate subsequences:

#!/usr/bin/env python
from SuffixTree  import SubstringDict
from collections import defaultdict
from itertools   import groupby
from operator    import itemgetter
import sys

def main(stdout=sys.stdout):
    """
    >>> import StringIO
    >>> s = StringIO.StringIO()
    >>> main(stdout=s)
    >>> print s.getvalue()
    [['Mi', 'Fa']] In Arrays (1, 2)
    [['So', 'La', 'Ti']] In Arrays (1, 3)
    [['Jim', 'Bob', 'So']] In Arrays (2, 3)
    [['So']] In Arrays (1, 2, 3)
    <BLANKLINE>
    """
    # array of arrays of strings
    arr = [
        ["Do", "Re", "Mi", "Fa", "So", "La", "Ti",],
        ["Mi", "Fa", "Jim", "Bob", "So",],
        ["Jim", "Bob", "So", "La", "Ti",],
    ]

####    # 28 seconds  (27 seconds without lesser substrs inspection (see below))
####    N, M = 100, 100
####    import random
####    arr = [[random.randrange(100) for _ in range(M)] for _ in range(N)]

    # convert to ASCII alphabet (for SubstringDict)
    letter2item = {}
    item2letter = {}
    c = 1
    for item in (i for a in arr for i in a):
        if item not in item2letter:
            c += 1
            if c == 128:
                raise ValueError("too many unique items; "
                                 "use a less restrictive alphabet for SuffixTree")
            letter = chr(c)
            letter2item[letter] = item
            item2letter[item] = letter
    arr_ascii = [''.join(item2letter[item] for item in a) for a in arr]

    # populate substring dict (based on SuffixTree)
    substring_dict = SubstringDict()
    for i, s in enumerate(arr_ascii):
        substring_dict[s] = i+1

    # enumerate all substrings, save those that occur more than once
    substr2indices = {}
    indices2substr = defaultdict(list)
    for str_ in arr_ascii:
        for start in range(len(str_)):
            for size in reversed(range(1, len(str_) - start + 1)):
                substr = str_[start:start + size]
                if substr not in substr2indices:
                    indices = substring_dict[substr] # O(n) SuffixTree
                    if len(indices) > 1:
                        substr2indices[substr] = indices
                        indices2substr[tuple(indices)].append(substr)
####                        # inspect all lesser substrs
####                        # it could diminish size of indices2substr[ind] list
####                        # but it has no effect for input 100x100x100 (see above)
####                        for i in reversed(range(len(substr))):
####                            s = substr[:i]
####                            if s in substr2indices: continue
####                            ind = substring_dict[s]
####                            if len(ind) > len(indices):
####                                substr2indices[s] = ind
####                                indices2substr[tuple(ind)].append(s)
####                                indices = ind
####                            else:
####                                assert set(ind) == set(indices), (ind, indices)
####                                substr2indices[s] = None
####                        break # all sizes inspected, move to next `start`

    for indices, substrs in indices2substr.iteritems():
        # remove substrs that are substrs of other substrs
        substrs = sorted(substrs, key=len) # sort by size
        substrs = [p for i, p in enumerate(substrs)
                   if not any(p in q  for q in substrs[i+1:])]
        # convert letters to items and print
        items = [map(letter2item.get, substr) for substr in substrs]
        print >>stdout, "%s In Arrays %s" % (items, indices)

if __name__=="__main__":
    # test
    import doctest; doctest.testmod()
    # measure performance
    import timeit
    t = timeit.Timer(stmt='main(stdout=s)',
                     setup='from __main__ import main; from cStringIO import StringIO as S; s = S()')
    N = 1000
    milliseconds = min(t.repeat(repeat=3, number=N))
    print("%.3g milliseconds" % (1e3*milliseconds/N))

It takes about 30 seconds to process 100 lists of 100 items each. SubstringDict in the above code might be emulated by grep -F -f.

Old solution:

In Python (save it to 'group_patterns.py' file):

#!/usr/bin/env python
from collections import defaultdict
from itertools   import groupby

def issubseq(p, q):
    """Return whether `p` is a subsequence of `q`."""
    return any(p == q[i:i + len(p)] for i in range(len(q) - len(p) + 1))

arr = (("Do", "Re", "Mi", "Fa", "So", "La", "Ti",),
       ("Mi", "Fa", "Jim", "Bob", "So",),
       ("Jim", "Bob", "So", "La", "Ti",))

# store all patterns that occure at least twice
d = defaultdict(list) # a map: pattern -> indexes of arrays it's within
for i, a in enumerate(arr[:-1]):
    for j, q in enumerate(arr[i+1:]): 
        for k in range(len(a)):
            for size in range(1, len(a)+1-k):
                p = a[k:k + size] # a pattern
                if issubseq(p, q): # `p` occures at least twice
                    d[p] += [i+1, i+2+j]

# group patterns by arrays they are within
inarrays = lambda pair: sorted(set(pair[1]))
for key, group in groupby(sorted(d.iteritems(), key=inarrays), key=inarrays):
    patterns = sorted((pair[0] for pair in group), key=len) # sort by size
    # remove patterns that are subsequences of other patterns
    patterns = [p for i, p in enumerate(patterns)
                if not any(issubseq(p, q)  for q in patterns[i+1:])]
    print "%s In Arrays %s" % (patterns, key)

The following command:

$ python group_patterns.py

prints:

[('Mi', 'Fa')] In Arrays [1, 2]
[('So',)] In Arrays [1, 2, 3]
[('So', 'La', 'Ti')] In Arrays [1, 3]
[('Jim', 'Bob', 'So')] In Arrays [2, 3]

The solution is terribly inefficient.

answered Dec 15 '22 05:12

9 revs

As others have mentioned the function you want is Intersect. If you are using .NET 3.0 consider using LINQ's Intersect function.

See the following post for more information

Consider using LinqPAD to experiment.

www.linqpad.net

answered Dec 15 '22 03:12

Mark Lindell

Related questions
                            
                                C++ Incorrect behavior of std::string initialization
                            
                                In Swift 2.0 what is the maximum length of a string?
                            
                                How to Convert string("1.0000") to int
                            
                                Convert a String to binary in swift?
                            
                                How to do Byte Pair Encoding bigram counting and replacements efficiently in Python?
                            
                                How to custom sort an alphanumeric list?
                            
                                Performing a Regex search and Replace on a std::string
                            
                                Splitting string by delimiter in R [duplicate]
                            
                                Matlab repr function
                            
                                Javascript .match regular expression with reverse quantifier (or parse right to left)
                            
                                Replace strings using List Comprehensions
                            
                                Split and re-concatenate a string
                            
                                How to check parentheses validation [duplicate]
                            
                                Dynamic way to hide divs vanilla javascript
                            
                                Convert Stream to String in Java
                            
                                Parse HTML String to DOM and convert it back to string
                            
                                Split string with repeated delimiters
                            
                                Find the difference between strings for each two rows of pandas data.frame
                            
                                Why does str primarily exist in it's borrowed form? [duplicate]
                            
                                "The argument type 'String?' can't be assigned to the parameter type 'String'" when using stdin.readLineSync()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With