Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to find "most common elements" in different arrays

I have for example 5 arrays with some inserted elements (numbers):

1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30

I need to find most common elements in those arrays and every element should go all the way till the end (see example below). In this example that would be the bold combination (or the same one but with "30" on the end, it's the "same") because it contains the smallest number of different elements (only two, 4 and 2/30).

This combination (see below) isn't good because if I have for ex. "4" it must "go" till it ends (next array mustn't contain "4" at all). So combination must go all the way till the end.

1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30

EDIT2: OR

1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30

OR anything else is NOT good.

Is there some algorithm to speed this thing up (if I have thousands of arrays with hundreds of elements in each one)?

To make it clear - solution must contain lowest number of different elements and the groups (of the same numbers) must be grouped from first - larger ones to the last - smallest ones. So in upper example 4,4,4,2 is better then 4,2,2,2 because in first example group of 4's is larger than group of 2's.

EDIT: To be more specific. Solution must contain the smallest number of different elements and those elements must be grouped from first to last. So if I have three arrrays like

1,2,3
1,4,5
4,5,6

Solution is 1,1,4 or 1,1,5 or 1,1,6 NOT 2,5,5 because 1's have larger group (two of them) than 2's (only one).

Thanks.

EDIT3: I can't be more specific :(

EDIT4: @spintheblack 1,1,1,2,4 is the correct solution because number used first time (let's say at position 1) can't be used later (except it's in the SAME group of 1's). I would say that grouping has the "priority"? Also, I didn't mention it (sorry about that) but the numbers in arrays are NOT sorted in any way, I typed it that way in this post because it was easier for me to follow.

like image 255
svenkapudija Avatar asked Feb 18 '11 20:02

svenkapudija


4 Answers

Here is the approach you want to take, if arrays is an array that contains each individual array.

  1. Starting at i = 0
  2. current = arrays[i]
  3. Loop i from i+1 to len(arrays)-1
  4. new = current & arrays[i] (set intersection, finds common elements)
  5. If there are any elements in new, do step 6, otherwise skip to 7
  6. current = new, return to step 3 (continue loop)
  7. print or yield an element from current, current = arrays[i], return to step 3 (continue loop)

Here is a Python implementation:

def mce(arrays):
  count = 1
  current = set(arrays[0])
  for i in range(1, len(arrays)):
    new = current & set(arrays[i])
    if new:
      count += 1
      current = new
    else:
      print " ".join([str(current.pop())] * count),
      count = 1
      current = set(arrays[i])
  print " ".join([str(current.pop())] * count)

>>> mce([[1, 4, 8, 10], [1, 2, 3, 4, 11, 15], [2, 4, 20, 21], [2, 30]])
4 4 4 2
like image 64
Andrew Clark Avatar answered Sep 27 '22 02:09

Andrew Clark


If all are number lists, and are all sorted, then,

  1. Convert to array of bitmaps.
  2. Keep 'AND'ing the bitmaps till you hit zero. The position of the 1 in the previous value indicates the first element.
  3. Restart step 2 from the next element
like image 40
CMR Avatar answered Sep 24 '22 02:09

CMR


This has now turned into a graphing problem with a twist.

The problem is a directed acyclic graph of connections between stops, and the goal is to minimize the number of lines switches when riding on a train/tram.

ie. this list of sets:

1,4,8,10           <-- stop A
1,2,3,4,11,15      <-- stop B
2,4,20,21          <-- stop C
2,30               <-- stop D, destination

He needs to pick lines that are available at his exit stop, and his arrival stop, so for instance, he can't pick 10 from stop A, because 10 does not go to stop B.

So, this is the set of available lines and the stops they stop on:

             A     B     C     D
line 1  -----X-----X-----------------
line 2  -----------X-----X-----X-----
line 3  -----------X-----------------
line 4  -----X-----X-----X-----------
line 8  -----X-----------------------
line 10 -----X-----------------------
line 11 -----------X-----------------
line 15 -----------X-----------------
line 20 -----------------X-----------
line 21 -----------------X-----------
line 30 -----------------------X-----

If we consider that a line under consideration must go between at least 2 consecutive stops, let me highlight the possible choices of lines with equal signs:

             A     B     C     D
line 1  -----X=====X-----------------
line 2  -----------X=====X=====X-----
line 3  -----------X-----------------
line 4  -----X=====X=====X-----------
line 8  -----X-----------------------
line 10 -----X-----------------------
line 11 -----------X-----------------
line 15 -----------X-----------------
line 20 -----------------X-----------
line 21 -----------------X-----------
line 30 -----------------------X-----

He then needs to pick a way that transports him from A to D, with the minimal number of line switches.

Since he explained that he wants the longest rides first, the following sequence seems the best solution:

  • take line 4 from stop A to stop C, then switch to line 2 from C to D

Code example:

stops = [
    [1, 4, 8, 10],
    [1,2,3,4,11,15],
    [2,4,20,21],
    [2,30],
]

def calculate_possible_exit_lines(stops):
    """
    only return lines that are available at both exit
    and arrival stops, discard the rest.
    """

    result = []
    for index in range(0, len(stops) - 1):
        lines = []
        for value in stops[index]:
            if value in stops[index + 1]:
                lines.append(value)
        result.append(lines)
    return result

def all_combinations(lines):
    """
    produce all combinations which travel from one end
    of the journey to the other, across available lines.
    """

    if not lines:
        yield []
    else:
        for line in lines[0]:
            for rest_combination in all_combinations(lines[1:]):
                yield [line] + rest_combination

def reduce(combination):
    """
    reduce a combination by returning the number of
    times each value appear consecutively, ie.
    [1,1,4,4,3] would return [2,2,1] since
    the 1's appear twice, the 4's appear twice, and
    the 3 only appear once.
    """

    result = []
    while combination:
        count = 1
        value = combination[0]
        combination = combination[1:]
        while combination and combination[0] == value:
            combination = combination[1:]
            count += 1
        result.append(count)
    return tuple(result)

def calculate_best_choice(lines):
    """
    find the best choice by reducing each available
    combination down to the number of stops you can
    sit on a single line before having to switch,
    and then picking the one that has the most stops
    first, and then so on.
    """

    available = []
    for combination in all_combinations(lines):
        count_stops = reduce(combination)
        available.append((count_stops, combination))
    available = [k for k in reversed(sorted(available))]
    return available[0][1]

possible_lines = calculate_possible_exit_lines(stops)
print("possible lines: %s" % (str(possible_lines), ))
best_choice = calculate_best_choice(possible_lines)
print("best choice: %s" % (str(best_choice), ))

This code prints:

possible lines: [[1, 4], [2, 4], [2]]
best choice: [4, 4, 2]

Since, as I said, I list lines between stops, and the above solution can either count as lines you have to exit from each stop or lines you have to arrive on into the next stop.

So the route is:

  • Hop onto line 4 at stop A and ride on that to stop B, then to stop C
  • Hop onto line 2 at stop C and ride on that to stop D

There are probably edge-cases here that the above code doesn't work for.

However, I'm not bothering more with this question. The OP has demonstrated a complete incapability in communicating his question in a clear and concise manner, and I fear that any corrections to the above text and/or code to accommodate the latest comments will only provoke more comments, which leads to yet another version of the question, and so on ad infinitum. The OP has gone to extraordinary lengths to avoid answering direct questions or to explain the problem.

like image 37
Lasse V. Karlsen Avatar answered Sep 25 '22 02:09

Lasse V. Karlsen


I am assuming that "distinct elements" do not have to actually be distinct, they can repeat in the final solution. That is if presented with [1], [2], [1] that the obvious answer [1, 2, 1] is allowed. But we'd count this as having 3 distinct elements.

If so, then here is a Python solution:

def find_best_run (first_array, *argv):
    # initialize data structures.
    this_array_best_run = {}
    for x in first_array:
        this_array_best_run[x] = (1, (1,), (x,))

    for this_array in argv:
        # find the best runs ending at each value in this_array
        last_array_best_run = this_array_best_run
        this_array_best_run = {}

        for x in this_array:
            for (y, pattern) in last_array_best_run.iteritems():
                (distinct_count, lengths, elements) = pattern
                if x == y:
                    lengths = tuple(lengths[:-1] + (lengths[-1] + 1,))
                else :
                    distinct_count += 1
                    lengths = tuple(lengths + (1,))
                    elements = tuple(elements + (x,))

                if x not in this_array_best_run:
                    this_array_best_run[x] = (distinct_count, lengths, elements)
                else:
                    (prev_count, prev_lengths, prev_elements) = this_array_best_run[x]
                    if distinct_count < prev_count or prev_lengths < lengths:
                        this_array_best_run[x] = (distinct_count, lengths, elements)

    # find the best overall run
    best_count = len(argv) + 10 # Needs to be bigger than any possible answer.
    for (distinct_count, lengths, elements) in this_array_best_run.itervalues():
        if distinct_count < best_count:
            best_count = distinct_count
            best_lengths = lengths
            best_elements = elements
        elif distinct_count == best_count and best_lengths < lengths:
            best_count = distinct_count
            best_lengths = lengths
            best_elements = elements

    # convert it into a more normal representation.                
    answer = []
    for (length, element) in zip(best_lengths, elements):
        answer.extend([element] * length)

    return answer

# example
print find_best_run(
    [1,4,8,10],
    [1,2,3,4,11,15],
    [2,4,20,21],
    [2,30]) # prints [4, 4, 4, 30]

Here is an explanation. The ...this_run dictionaries have keys which are elements in the current array, and they have values which are tuples (distinct_count, lengths, elements). We are trying to minimize distinct_count, then maximize lengths (lengths is a tuple, so this will prefer the element with the largest value in the first spot) and are tracking elements for the end. At each step I construct all possible runs which are a combination of a run up to the previous array with this element next in sequence, and find which ones are best to the current. When I get to the end I pick the best possible overall run, then turn it into a conventional representation and return it.

If you have N arrays of length M, this should take O(N*M*M) time to run.

like image 20
btilly Avatar answered Sep 26 '22 02:09

btilly