Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm: What set of tiles of length N can be used to generate the most amount of Scrabble-valid words?

I'm trying to create a function best_tiles which takes in the number of tiles in your hand and returns the set of tiles that allows you to produce the most number of unique English-valid words, assuming that you can only use each tile once.

For example, with the set of tiles in your hand (A, B, C) you can produce the words, CAB, BAC, AB, and BA (all of these are English words), so you can spell 4 unique words with that set. With (B, A, A), you can spell 5 words: ABA, BAA, AA, AB, and BA. The goal is to find the set of letters which allows you to spell the most number of English-Valid words (without replacement).

So if 5 was the maximum number of words that could be spelled with any combination of letters for N = 3, running best_tiles( n = 3 ) would return B, A, A.

I'm wondering how to implement this efficiently? My current approach doesn't scale well at all with number of letters.

I read in a wordlist. In this case, I'm using enable.txt here: https://www.wordgamedictionary.com/enable/

import os
path = "enable.txt"
words = []
with open(path , encoding='utf8') as f: 
    for values in f:
        words.append(list(values.strip().upper()))

I create a function word_in_tiles h/t smack89 which returns whether it is possible to construct a word given a tile set:

def word_in_tiles(word, tiles):
  tiles_counter = collections.Counter(tiles)
  return all(tiles_counter.get(ch, 0) == cnt for ch,cnt in 
  collections.Counter(word).items())

I then create a function get_all_words which produces a list of all the possible words one can spell from a word list and a tile set.

def get_all_words(tile_set, words):
    # words is a word list
    return [i for i in words if word_in_tiles(i, tile_set)]

The extremely naive approach for identifying which tileset is the "best" for three letters is the following:

I first create a list of every possible combination for a given length. So for length 3, I'd do:

import string
import itertools 

letters = string.ascii_lowercase
all_three_letter_combinations = list(itertools.combinations_with_replacement(letters, 3))

# Create a list of only words that are three letters are less
three_letter = [i for i in words if len(i) <= 3]

sequence_dict = dict()
    for i in all_three_letter_combinations:
        string_test = "".join(i).upper()
        sequence_dict[i] = get_all_words(string_test, three_letter)

Then remove the values with no length and sort by the length of the result:

res = {k: v for k, v in sequence_dict.items() if len(v) >= 1}

def GetMaxLength(res):
    return max((len(v), v, k) for k, v in res.items())[1:]

GetMaxLength(res) You get that, for three letters, the tile-set that produces the most english valid words is T A E which can produce the following words ['AE', 'AT', 'ATE', 'EAT', 'ET', 'ETA', 'TA', 'TAE', 'TEA']

I'd like to be able to scale this up to as big as N = 15. What is the best procedure for doing this?

like image 234
Parseltongue Avatar asked Oct 05 '20 18:10

Parseltongue


4 Answers

This is my previous answer with another idea added that is borrowed from David Eisenstat. That idea being to start with a data structure that squashes out differences caused by common letters. The speedup was quite good.

Here is a full log of the code running under PyPy. Once you've gone pretty far, it goes very, very fast because the data structure is very small. The slowest was 23 letters.

0:00:00.000226

0:00:00.000095
ER
0:00:00.000514
EAT
0:00:00.014545
ESAT
0:00:00.069521
ESATL
0:00:00.063586
ESARTP
0:00:00.100273
ESARTOP
0:00:00.309665
ESIARTLP
0:00:01.171279
ESIARNTLP
0:00:02.435968
ESIARNTOLP
0:00:05.049897
ESIARNTOLDP
0:00:11.358067
ESIARNTOLCDP
0:00:23.378628
ESIARNTOLCDUP
0:00:38.672561
ESIARNTOLCDUPM
0:00:57.530691
ESIARNTOLCDUPMG
0:01:23.665592
ESIARNTOLCDUPMGH
0:01:43.571878
ESIARNTOLCDEUPMGH
0:02:02.831546
ESIARNTOLCDEUPMGSH
0:02:13.175045
ESIARNTOLCDEUPMGSHB
0:02:15.644423
ESIARNTOLCDEUPMGSHBY
0:02:25.681542
ESIARNTOLCDEUPMGSHBYK
0:02:38.858906
ESIARNTOLCDEUPMGSHBYAK
0:02:44.703865
ESIARNTOLCDEUPMGSHIBYAT
0:02:52.872105
ESIARNTOLCDEUPMGSHIBYATF
0:02:38.724424
ESIARNTOLCDEUPMGSHIBYATRF
0:02:19.892021
ESIARNTOLCDEUPMGSHIBYATRFV
0:02:01.069136
ESIARNTOLCDEUPMGSHIBYAOTRFV
0:01:39.992728
ESIARNTOLCDEUPMGSHIBYAONTRFV
0:01:09.800219
ESIARNTOLCDEUPMGSHIBYAONTRFVK
0:00:46.791619
ESIARNTOLCDEUPMGSHIBYAONTRFVKW
0:00:30.404072
ESIARNTOLCDEUPMGSHIBYAONTRFVLKW
0:00:17.932808
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWC
0:00:16.610114
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWEC
0:00:13.385111
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWECZ
0:00:11.731393
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZ
0:00:09.263752
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZD
0:00:08.459759
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZDP
0:00:06.778232
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZDPX
0:00:05.502753
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPX
0:00:05.504382
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXM
0:00:03.345941
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMU
0:00:02.786702
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUG
0:00:02.258980
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGQ
0:00:02.181486
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGQJ
0:00:02.069378
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGOHQ
0:00:01.678280
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGOHQJ
0:00:01.758940
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJ
0:00:01.015172
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJA
0:00:00.782816
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJAB
0:00:00.590803
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJABF
0:00:00.560564
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJABFT
0:00:00.789469
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFT
0:00:00.213274
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTR
0:00:00.142143
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTER
0:00:00.117293
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERL
0:00:00.062241
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLY
0:00:00.044667
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIY
0:00:00.032022
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYK
0:00:00.033368
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKC
0:00:00.028749
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCV
0:00:00.028905
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVW
0:00:00.038960
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVWZ
0:00:00.032474
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZ
0:00:00.031382
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZD
0:00:00.032725
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDG
0:00:00.028246
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGO
0:00:00.026071
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGON
0:00:00.023227
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONM
0:00:00.019831
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMP
0:00:00.015209
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPU
0:00:00.011805
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUE
0:00:00.013564
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEI
0:00:00.011815
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIA
0:00:00.005378
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIAB
0:00:00.013570
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABL
0:00:00.004920
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLH
0:00:00.002972
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHT
0:00:00.002923
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTF
0:00:00.003162
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSF
0:00:00.004121
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFR
0:00:00.004218
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJ
0:00:00.005604
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJC
0:00:00.006316
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCY
0:00:00.012737
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYN
0:00:00.010365
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNO
0:00:00.011111
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOI
0:00:00.013791
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIG
0:00:00.016119
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGD
0:00:00.017771
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDX
0:00:00.015684
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXW
0:00:00.014175
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZW
0:00:00.015253
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWU
0:00:00.009111
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEU
0:00:00.009990
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUK
0:00:00.013182
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKQ
0:00:00.003639
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKQV
0:00:00.002877
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQV
0:00:00.001968
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVM
0:00:00.001155
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMA
0:00:00.000890
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAP
0:00:00.000592
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPF
0:00:00.000894
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPFB
0:00:00.000512
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPFTB
0:00:00.000452
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPFTBE
0:00:00.000365
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPFTBEI
0:00:00.000277
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPFTBEIZ
0:00:00.000368
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPFTBEIZK
0:00:00.000321
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPFTBEIZKL
0:00:00.000275

And here is the code

import re
import os
import datetime
path = "enable.txt"
words = []
with open(path) as f:
    for values in f:
        words.append(values.strip().upper())

key_count = {}
for word in words:
    seen = {}
    for letter in word:
        if letter not in seen:
            seen[letter] = 0
        key = (letter, seen[letter])
        if key not in key_count:
            key_count[key] = 1
        else:
            key_count[key] += 1
        seen[letter] += 1


KEYS = sorted(key_count.keys(), key=lambda key: -key_count[key])
#print(KEYS)
#print(len(KEYS))
KEY_POS = {}
for i in range(len(KEYS)):
    KEY_POS[KEYS[i]] = i
    KEY_POS[KEYS[i]] = i

# Now we will build a trie.  Every node has a list of words, and a dictionary
# from the next letter farther in the trie.
# BUT TRICK:, we will map each word to a sequence of numbers, and those numbers
# will be indexes into KEYS.  This allows us to use the fact that a second 'e' is
# unlikely to store that efficiently.
class Trie:
    def __init__(self, path):
        self.words = []
        self.dict = {}
        self.words_len = 0
        self.count_words = 0
        self.path = path

    def add_word (self, word):
        trie = self

        poses = []
        seen = {}
        for letter in word:
            if letter not in seen:
                seen[letter] = 0
            key = (letter, seen[letter])
            poses.append(KEY_POS[(key)])
            seen[letter] += 1
        sorted_poses = sorted(poses);
        for i in range(len(sorted_poses)):
            trie.count_words += 1
            pos = sorted_poses[i]
            if pos not in trie.dict:
                trie.dict[pos] = Trie(trie.path + KEYS[pos][0])
            trie = trie.dict[pos]
        trie.count_words += 1
        trie.words_len += 1
        trie.words.append(word)

    def remove (self, pos):
        trie = Trie(self.path)
        trie.words = None
        trie.dict = self.dict.copy()
        trie.words_len = self.words_len
        trie.count_words = self.count_words
        trie.path = self.path
        if pos in trie.dict:
            trie.count_words -= trie.dict[pos].count_words
            trie.dict.pop(pos)
        return trie

    def merge (self, other):
        trie = Trie(self.path)
        trie.words = None
        trie.words_len = self.words_len + other.words_len
        trie.count_words = self.count_words + other.count_words
        trie.path = self.path
        trie.dict = self.dict.copy()
        for pos, subtrie in other.dict.items():
            if pos in trie.dict:
                trie.dict[pos] = trie.dict[pos].merge(subtrie)
            else:
                trie.dict[pos] = subtrie
        return trie

    def __repr__ (self):
        answer = self.path + ":\n  words_len: " + str(self.words_len) + "\n  count_words:" + str(self.count_words) + " \n  dict:\n"
        def indent (string):
            p = re.compile("^", re.M)
            return p.sub("    ", string)
        for pos in sorted(self.dict.keys()):
            subtrie = self.dict[pos]
            answer = answer + indent(str(pos) + ":\n" + subtrie.__repr__())
        return answer

base_trie = Trie('')
for word in words:
    base_trie.add_word(word)

base_tries = [base_trie]
last_trie = base_trie
for i in range(len(KEYS)):
    subtrie = last_trie.dict[i]
    last_trie = last_trie.remove(i).merge(subtrie)
    base_tries.append(last_trie)

#print(base_tries[1].__repr__())

def best_solution (size):
    def solve (subset, pos, best, partial):
        found = sum(x[0] for x in partial)
        upper_bound = sum(x[1] for x in partial)
        if size <= len(subset) or upper_bound < best or len(KEYS) <= pos:
            return (found, subset)
        if best < found:
            best = found

        # Figure out our next calculations.
        partial_include = []
        partial_exclude = []
        finalized_found = 0
        for this_found, this_bound, this_trie in partial:
            if this_trie is None:
                # This is a generic record of already emptied tries
                finalized_found += this_found
            elif pos in this_trie.dict:
                include_trie = this_trie.dict[pos]
                partial_include.append((
                    this_found + include_trie.words_len,
                    include_trie.count_words + this_found,
                    include_trie
                ))
                # We included the tally of found words in the previous partial.
                # So do not double-count by including it again
                partial_include.append((
                    0,
                    this_bound - include_trie.count_words - this_found,
                    this_trie
                ))
                partial_exclude.append((
                    this_found,
                    this_bound - include_trie.count_words,
                    this_trie
                ))
            elif this_found == this_bound:
                finalized_found += this_found
            else:
                partial_include.append((
                    this_found,
                    this_bound,
                    this_trie
                ))

                partial_exclude.append((
                    this_found,
                    this_bound,
                    this_trie
                ))
        if 0 < finalized_found:
            partial_include.append(
                (finalized_found, finalized_found, None)
            )

            partial_exclude.append(
                (finalized_found, finalized_found, None)
            )

        found_include, subset_include = solve(subset + [pos], pos+1, best, partial_include)
        if best < found_include:
            best = found_include
        found_exclude, subset_exclude = solve(subset, pos+1, best, partial_exclude)
        if found_include < found_exclude:
            return (found_exclude, subset_exclude)
        else:
            return (found_include, subset_include)


    best_count = 0
    best_subset = []
    start_subset = [i for i in range(size)]
    while 0 < len(start_subset):
        trie = base_tries[len(start_subset)].remove(start_subset[-1]+1)
        count, subset = solve(list(start_subset), start_subset[-1]+2, best_count, [(trie.words_len, trie.count_words, trie)])
        if best_count < count:
            best_count = count
            best_subset = subset
        start_subset.pop()
    return ''.join([KEYS[x][0] for x in best_subset])

for i in range(len(KEYS):
    start = datetime.datetime.now()
    print(best_solution(i))
    print(datetime.datetime.now() - start)
like image 110
btilly Avatar answered Oct 18 '22 00:10

btilly


This code can optimize n=15 in a couple minutes using PyPy on my laptop, finding

10701 acdegilmnoprstu.

The idea is to do branch and bound, where at each node some letters are forced to be included, and others are excluded. We derive an upper bound on the quality of each node by finding an order-preserving map f (preserving the partial order of multiset inclusion) from multisets of letters to a smaller partially ordered space, then counting the number of words we can get where f(word) is included in the best f(tiles). On the smaller space we can brute force the problem using a fast convolution method (reminiscent of FFT).

To find a good space, we greedily remove letters one at a time so as to affect as few words as possible, until the upper bound can be brute forced.

import array
import collections
import functools
import heapq


def count_occurrences_of_letters(raw_word):
    occurs = collections.Counter()
    word = []
    for letter in raw_word:
        word.append(letter + str(occurs[letter]))
        occurs[letter] += 1
    return word


def greedy_censorship_order(words):
    hits = collections.defaultdict(set)
    for index, word in enumerate(words):
        for letter in word:
            hits[letter].add(index)
    order = []
    while hits:
        censored_letter = min(hits.keys(), key=lambda letter: len(hits[letter]))
        order.append(censored_letter)
        for index in hits[censored_letter]:
            for letter in words[index]:
                if letter != censored_letter:
                    hits[letter].remove(index)
        del hits[censored_letter]
    return order


def bitmap_from_word(word, alphabet):
    bitmap = 0
    censored = 0
    for letter in word:
        try:
            bitmap |= 1 << alphabet.index(letter)
        except ValueError:
            censored += 1
    return bitmap, censored


def sum_over_subsets(vector, dimension):
    for i in range(dimension):
        bit = 1 << i
        for bitmap in range(1 << dimension):
            if not (bitmap & bit):
                vector[bitmap | bit] += vector[bitmap]


def count_set_bits(n):
    return bin(n).count("1")


@functools.total_ordering
class Node:
    def __init__(self, subset, n, unfiltered_words):
        self.subset = subset
        self.n = n
        self.words = [word for word in unfiltered_words if len(word) <= n]
        self.upper_bound = sum(not word for word in self.words)
        if n == 0:
            return
        order = greedy_censorship_order(self.words)
        if not order:
            self.pivot = None
            return
        self.pivot = order[-1]
        alphabet = order[-(n + 7) :]
        zeros = [0] * (1 << len(alphabet))
        vectors = [array.array("l", zeros) for i in range(n + 1)]
        for word in self.words:
            bitmap, censored = bitmap_from_word(word, alphabet)
            for i in range(censored, n + 1):
                vectors[i][bitmap] += 1
        for vector in vectors:
            sum_over_subsets(vector, len(alphabet))
        self.upper_bound = max(
            vectors[n - count_set_bits(bitmap)][bitmap]
            for bitmap in range(1 << len(alphabet))
            if count_set_bits(bitmap) <= n
        )

    def children(self):
        if self.pivot is None:
            return
        yield Node(
            self.subset, self.n, [word for word in self.words if self.pivot not in word]
        )
        yield Node(
            self.subset | {self.pivot},
            self.n - 1,
            [
                [letter for letter in word if letter != self.pivot]
                for word in self.words
            ],
        )

    def __eq__(self, other):
        return self.upper_bound == other.upper_bound

    def __ne__(self, other):
        return self.upper_bound != other.upper_bound

    def __lt__(self, other):
        return self.upper_bound > other.upper_bound


def solve(n, words):
    heap = [Node(set(), n, words)]
    while True:
        top = heapq.heappop(heap)
        print(top.upper_bound, "".join(sorted(letter[0] for letter in top.subset)))
        if top.n == 0:
            return
        for child in top.children():
            heapq.heappush(heap, child)


def main():
    with open("enable.txt") as file:
        raw_words = file.read().split()
    words = [count_occurrences_of_letters(word) for word in raw_words]
    solve(15, words)


if __name__ == "__main__":
    main()
like image 43
David Eisenstat Avatar answered Oct 18 '22 01:10

David Eisenstat


Here's a "dumb" sum-over-subsets that accumulates, for each count of 1 to 26, the selection of distinct letters that yields the most words in the file "enable.txt" in under 33 seconds on my laptop. (The 32 seconds is a speedup by David Eisenstat, changing my original code that ran in 6 minutes 45 seconds to an in-place method).

Since btilly and David Eisenstat already conducted the difficult work of optimising a search that would also include duplicates, we know that the information here up to 16 letters is useful.

from collections import defaultdict

def as_number(word):
    word = word.lower();
    n = 0
    for c in word:
        m = ord(c) - 97
        if n & (1 << m):
            return 0
        else:
            n |= 1 << m
    return n

def get_letters(n):
    letters = ""
    i = 0
    while n:
        if n & 1:
            letters += chr(97 + i)
        n >>= 1
        i += 1
    return letters

def f(words, N):
    hash = defaultdict(lambda: 0) #[0] * (1 << N)

    for w in words:
        num = as_number(w)
        if num:
            hash[num] += 1 #= -~hash[num]
  
    dp = [hash.get(mask, 0) for mask in range(1 << N)]

    for i in range(N):
        for mask in range(1 << N):
            if not (mask & (1 << i)):
                dp[mask ^ (1 << i)] += dp[mask]

    result = {}

    for i in xrange(1, 1 << N):
        k = bin(i).count("1")
        if k in result:
            if result[k]["best"] == dp[i]:
                result[k]["best_letters"].append(get_letters(i))
            elif result[k]["best"] < dp[i]:
                result[k]["best"] = dp[i]
                result[k]["best_letters"] = [get_letters(i)]
        elif dp[i]:
            result[k] = {
                "best": dp[i],
                "best_letters": [get_letters(i)]
            }

    return result

import os
import datetime
path = "enable.txt"
words = []
with open(path) as file:
    for values in file:
        words.append(values.strip())

start = datetime.datetime.now()
print f(words, 26)
print(datetime.datetime.now() - start)

Output:

// ♥ pypy py.py
{
    2: {
        'best': 2,
        'best_letters': ['ab', 'de', 'ah', 'eh', 'al', 'am', 'em', 'an', 'en', 'do', 'ho', 'mo', 'no', 'er', 'is', 'os', 'at', 'it', 'mu', 'nu', 'ow', 'ay', 'oy']
    },
    3: {
        'best': 9,
        'best_letters': ['aet']
    },
    4: {
        'best': 24,
        'best_letters': ['aest']
    },
    5: {
        'best': 66,
        'best_letters': ['aelst']
    },
    6: {
        'best': 150,
        'best_letters': ['aeprst']
    },
    7: {
        'best': 283,
        'best_letters': ['aeoprst']
    },
    8: {
        'best': 543,
        'best_letters': ['aeilprst']
    },
    9: {
        'best': 945,
        'best_letters': ['aeilnprst']
    },
    10: {
        'best': 1590,
        'best_letters': ['aeilnoprst']
    },
    11: {
        'best': 2557,
        'best_letters': ['adeilnoprst']
    },
    12: {
        'best': 3855,
        'best_letters': ['acdeilnoprst']
    },
    13: {
        'best': 5648,
        'best_letters': ['acdeilnoprstu']
    },
    14: {
        'best': 8001,
        'best_letters': ['acdeilmnoprstu']
    },
    15: {
        'best': 10701,
        'best_letters': ['acdegilmnoprstu']
    },
    16: {
        'best': 14060,
        'best_letters': ['acdeghilmnoprstu']
    },
    17: {
        'best': 17225,
        'best_letters': ['abcdeghilmnoprstu']
    },
    18: {
        'best': 20696,
        'best_letters': ['abcdeghilmnoprstuy']
    },
    19: {
        'best': 23723,
        'best_letters': ['abcdeghiklmnoprstuy']
    },
    20: {
        'best': 26542,
        'best_letters': ['abcdefghiklmnoprstuy']
    },
    21: {
        'best': 29501,
        'best_letters': ['abcdefghiklmnoprstuwy']
    },
    22: {
        'best': 31717,
        'best_letters': ['abcdefghiklmnoprstuvwy']
    },
    23: {
        'best': 32675,
        'best_letters': ['abcdefghiklmnoprstuvwyz']
    },
    24: {
        'best': 33548,
        'best_letters': ['abcdefghiklmnoprstuvwxyz']
    },
    25: {
        'best': 34299,
        'best_letters': ['abcdefghijklmnoprstuvwxyz']
    },
    26: {
        'best': 34816,
        'best_letters': ['abcdefghijklmnopqrstuvwxyz']
    }
}
0:00:32.295888
like image 36
גלעד ברקן Avatar answered Oct 18 '22 01:10

גלעד ברקן


The goal should be to make your algorithm to run on the order of O(enablelist_size). That is we don't want to waste time generating words that are not in the list, that way you can ensure that your algorithm is efficient.


  1. The first thing you probably should do is to divide the list by word counts.

    So take your enable.txt file and divide it into files that contain all 2-letter words, 3-letter words, etc. Maybe call them enable_2.txt, enable_3.txt, etc.

    length_map = collections.defaultdict(list)
    with open("enable.txt", "r") as f:
      for word in f:
        length_map[len(word)].append(word)
    
    for n in length_map:
      with open(f"enable_{n}.txt", "w+") as f:
          f.write("\n".join(length_map[c]))
    
    

    You don't have to do this, but it really helps to speed up the next tasks

  2. Now if you have n tiles, you want to read all the enable_n.txt files which contain 2 to n - letter words.

    If you didn't do the first part, then just read the enable.txt file you have.

    For each word in the file, you need to check if your tiles contains all the letters in that word.

    This is a python function you can use to check this:

    import collections
    
    def word_in_tiles(word, tiles):
      tiles_counter = collections.Counter(tiles)
      return all(tiles_counter.get(ch, 0) == cnt for ch,cnt in collections.Counter(word).items())
    

    If the function returns True for any word, you can print that word.

This is method would be much faster than generating all possible combinations of tiles.

In the worst case, this method would be as fast as if you somehow generated all valid words which were already in enable.txt and then filter out the ones which didn't have the correct length.

like image 1
smac89 Avatar answered Oct 18 '22 00:10

smac89