Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter a Set for Matching String Permutations

I am trying to use itertools.permutations() to return all the permutations of the string and return only the ones which are members of a set of words.

import itertools

def permutations_in_dict(string, words): 
    '''
    Parameters
    ----------
    string : {str}
    words : {set}

    Returns
    -------
    list : {list} of {str}    

    Example
    -------
    >>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
    ['act', 'cat']
    '''

My current solution works fine in terminal but somehow couldn't pass the test case...

return list(set([''.join(p) for p in itertools.permutations(string)]) & words)

Any help will be appreciated.

like image 645
Meruemu Avatar asked Jul 01 '17 06:07

Meruemu


Video Answer


2 Answers

Problem Category

The problem you're solving is best described as testing for anagram matches.

Solution using Sort

The traditional solution is to sort the target string, sort the candidate string, and test for equality.

>>> def permutations_in_dict(string, words):
        target = sorted(string)
        return sorted(word for word in words if sorted(word) == target)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

Solution using Multisets

Another approach is to use collections.Counter() to make a multiset equality test. This is algorithmically superior to the sort solution (O(n) versus O(n log n)) but tends to lose unless the size of the strings is large (due to the cost of hashing all the characters).

>>> def permutations_in_dict(string, words):
        target = Counter(string)
        return sorted(word for word in words if Counter(word) == target)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

Solution using a Perfect Hash

A unique anagram signature or perfect hash can be constructed by multiplying prime numbers corresponding to each possible character in a string.

The commutative property of multiplication guarantees that the hash value will be invariant for any permutation of a single string. The uniqueness of the hash value is guaranteed by the fundamental theorem of arithmetic (also known as the unique prime factorization theorem).

>>> from operator import mul
>>> primes = [2, 3, 5, 7, 11]
>>> primes += [p for p in range(13, 1620) if all(pow(b, p-1, p) == 1 for b in (5, 11))]
>>> anagram_hash = lambda s: reduce(mul, (primes[ord(c)] for c in s))
>>> def permutations_in_dict(string, words):
        target = anagram_hash(string)
        return sorted(word for word in words if anagram_hash(word) == target)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

Solution using Permutations

Searching by permutations on the target string using itertools.permutations() is reasonable when the string is small (generating permutations on a n length string generates n factorial candidates).

The good news is that when n is small and the number of words is large, this approach runs very fast (because set membership testing is O(1)):

>>> from itertools import permutations
>>> def permutations_in_dict(string, words):
        perms = set(map(''.join, permutations(string)))
        return sorted(word for word in words if word in perms)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

As the OP surmised, the pure python search loop can be sped-up to c-speed by using set.intersection():

>>> def permutations_in_dict(string, words):
        perms = set(map(''.join, permutations(string)))
        return sorted(words & perms)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

Best Solution

Which solution is best depends on the length of string and the length of words. Timings will show which is best for a particular problem.

Here are some comparative timings for the various approaches using two different string sizes:

Timings with string_size=5 and words_size=1000000
-------------------------------------------------
0.01406    match_sort
0.06827    match_multiset
0.02167    match_perfect_hash
0.00224    match_permutations
0.00013    match_permutations_set

Timings with string_size=20 and words_size=1000000
--------------------------------------------------
2.19771    match_sort
8.38644    match_multiset
4.22723    match_perfect_hash
<takes "forever"> match_permutations
<takes "forever"> match_permutations_set

The results indicate that for small strings, the fastest approach searches permutations on the target string using set-intersection.

For larger strings, the fastest approach is the traditional sort-and-compare solution.

Hope you found this little algorithmic study as interesting as I have. The take-aways are:

  • Sets, itertools, and collections make short work of problems like this.
  • Big-oh running times matter (n-factorial disintegrates for large n).
  • Constant overhead matters (sorting beats multisets because of hashing overhead).
  • Discrete mathematics is a treasure trove of ideas.
  • It is hard to know what is best until you do analysis and run timings :-)

Timing Set-up

FWIW, here is a test set-up I used to run the comparative timings:

from collections import Counter
from itertools import permutations
from string import letters
from random import choice
from operator import mul
from time import time

def match_sort(string, words):
    target = sorted(string)
    return sorted(word for word in words if sorted(word) == target)

def match_multiset(string, words):
    target = Counter(string)
    return sorted(word for word in words if Counter(word) == target)

primes = [2, 3, 5, 7, 11]
primes += [p for p in range(13, 1620) if all(pow(b, p-1, p) == 1 for b in (5, 11))]
anagram_hash = lambda s: reduce(mul, (primes[ord(c)] for c in s))

def match_perfect_hash(string, words):
    target = anagram_hash(string)
    return sorted(word for word in words if anagram_hash(word) == target)

def match_permutations(string, words):
    perms = set(map(''.join, permutations(string)))
    return sorted(word for word in words if word in perms)

def match_permutations_set(string, words):
    perms = set(map(''.join, permutations(string)))
    return sorted(words & perms)

string_size = 5
words_size = 1000000

population = letters[: string_size+2]
words = set()
for i in range(words_size):
    word = ''.join([choice(population) for i in range(string_size)])
    words.add(word)
string = word                # Arbitrarily search use the last word as the target

print 'Timings with string_size=%d and words_size=%d' % (string_size, words_size)
for func in (match_sort, match_multiset, match_perfect_hash, match_permutations, match_permutations_set):
    start = time()
    func(string, words)
    end = time()
    print '%-10.5f %s' % (end - start, func.__name__)
like image 96
Raymond Hettinger Avatar answered Oct 25 '22 15:10

Raymond Hettinger


You can simply use collections.Counter() to compare the words to the string without creating all permutations (this explodes with length of string):

from collections import Counter

def permutations_in_dict(string, words):
    c = Counter(string)
    return [w for w in words if c == Counter(w)]

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['cat', 'act']

Note: sets are unordered so if you need a specific order you may need to sort the result, e.g. return sorted(...)

like image 30
AChampion Avatar answered Oct 25 '22 16:10

AChampion