Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to enumerate combinations filtering repeats

I have a list of possible choices: [[1], [2, 4], [4], [5, 6, 2], [5, 3]].

I want to list all combinations, taking maximum one element from each sublist, without repeating elements.

So [1, 2, 4, 5, 3] is a valid option. But [1, 4, 4, 5, 3] is not. I allow not making a choice in any sublist, so [1,4, None,5,3] is valid, as in [1, None, None, None, None], and [None, None, None, None, None].

I can't simply enumerate all combinations then filter out the ones I don't want, since for a large list of possible choices, it would quickly become computationally infeasible (I'm looking at 25^25 maximum combinations in my project).

edit: I would also apply some additional criteria to the results, such as filtering to have no more than a threshold of None choices, or sorting the resultant list of combinations in order of combinations with fewest None choices.

edit: with details of the real-life case: I'd like to apply it to a list of 25 sublists, each of which can have 1-25 elements. Realisitically, each sublist will have max 15 elements, with 2-4 on average.

So the easy solution of list(itertools.product(*choices)) then filtering is out.

I may also wish to add other filter conditions to the list of combinations, so ideally I can filter these upfront.

I've tried building a tree recursively, where e.g. root node has the full list of choices, child node makes the first choice [1], and has an updated list of choices where '1' is removed from all list[1:] choices.

Struggling to implement the recursion though.

Can you help me with any other approaches?

like image 713
AdZinu Avatar asked Dec 14 '22 06:12

AdZinu


2 Answers

Another way to generate all valid outputs with minimal memory usage is to iterate over the elements rather than over the lists. Use a Depth-First search so that you only generate valid outputs from the start. This means that we need to track three things in each level of our DFS: the current element to maybe add, the lists that are already used, and the order that we used those previous lists in.

To help with our search, we preprocess choices by mapping every element to a set of the possible lists it can be in, which create a sort of Dual version of the problem. This also generates the lists in roughly 'maximum non-empty choices first' order.

Since you've specified that the number of unique elements, N, is equal to the number of lists, this approach runs in O(N * |output|) time, and we use generators everywhere to save memory.

import collections
from typing import Dict, Generator, List, Optional, Set

choices = [[1], [2, 4], [4], [5, 6, 2], [5, 3]]
element_to_containing_lists: Dict[int, Set[int]] = collections.defaultdict(set)

for i, choice_list in enumerate(choices):
    for x in choice_list:
        element_to_containing_lists[x].add(i)

all_unique_elements = sorted(element_to_containing_lists)


def dfs(used_list_indices: Set[int],
        next_element_index: int,
        used_list_ordered_indices: List[Optional[int]]) -> Generator[List[Optional[int]]]:
    if next_element_index == len(all_unique_elements):
        yield used_list_ordered_indices
    else:
        # If we use the element, find an unused list index
        for possible_list_to_use in element_to_containing_lists[
                                        all_unique_elements[next_element_index]] - used_list_indices:
            yield from dfs(used_list_indices | {possible_list_to_use},
                           next_element_index + 1,
                           used_list_ordered_indices + [possible_list_to_use])

        # If we don't use the element: Add None as a sentinel value
        yield from dfs(used_list_indices,
                       next_element_index + 1,
                       used_list_ordered_indices + [None])


for element_to_used_list in dfs(set(), 0, []):
    list_to_chosen_element = ['N'] * len(choices)
    for x, y in zip(all_unique_elements, element_to_used_list):
        if y is not None:
            list_to_chosen_element[y] = x
    print(*list_to_chosen_element, sep='  ')

First 10 lines of the output:

1  2  4  5  3
1  2  4  6  3
1  2  4  N  3
1  2  N  5  3
1  2  N  6  3
1  2  N  N  3
1  2  4  5  N
1  2  4  6  5
1  2  4  N  5

This can possibly be optimized to run in O(|output|) time by using a bitmask for 'used lists' rather than a set of their indices.

like image 117
kcsquared Avatar answered Dec 20 '22 16:12

kcsquared


Don't turn the result into a list. product is a generator. Use that to your advantage. filter is a generator too. You only hold one combination in memory this way. Sometimes a single output of product will be discarded internally without you seeing it, but it won't take up any extra space.

def criterion(x):
    k = [i for i in x if i is not None]
    return len(k) == len(set(k))

choices = [[1], [2, 4], [4], [5, 6, 2], [5, 3]]
for c in choices:
    c.append(None)
filter(criterion, product(*choices))
like image 37
Mad Physicist Avatar answered Dec 20 '22 14:12

Mad Physicist