Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal merging of triplets

I'm trying to come up with an algorithm for the following problem :

I've got a collection of triplets of integers - let's call these integers A, B, C. The value stored inside can be big, so generally it's impossible to create an array of size A, B, or C. The goal is to minimize the size of the collection. To do this, we're provided a simple rule that allows us to merge the triplets :

  • For two triplets (A, B, C) and (A', B', C'), remove the original triplets and place the triplet (A | A', B, C) if B == B' and C = C', where | is bitwise OR. Similar rules hold for B and C also.

In other words, if two values of two triplets are equal, remove these two triplets, bitwise OR the third values and place the result to the collection.

The greedy approach is usually misleading in similar cases and so it is for this problem, but I can't find a simple counterexample that'd lead to a correct solution. For a list with 250 items where the correct solution is 14, the average size computed by greedy merging is about 30 (varies from 20 to 70). The sub-optimal overhead gets bigger as the list size increases.

I've also tried playing around with set bit counts, but I've found no meaningful results. Just the obvious fact that if the records are unique (which is safe to assume), the set bit count always increases.

Here's the stupid greedy implementation (it's just a conceptual thing, please don't regard the code style) :

public class Record {
    long A;
    long B;
    long C;

    public static void main(String[] args) {
        List<Record> data = new ArrayList<>();
        // Fill it with some data

        boolean found;

        do {
            found = false;
            outer:
            for (int i = 0; i < data.size(); ++i) {
                for (int j = i+1; j < data.size(); ++j) {
                    try {
                        Record r = merge(data.get(i), data.get(j));
                        found = true;
                        data.remove(j);
                        data.remove(i);
                        data.add(r);
                        break outer;
                    } catch (IllegalArgumentException ignored) {
                    }
                }
            }
        } while (found);
    }

    public static Record merge(Record r1, Record r2) {
        if (r1.A == r2.A && r1.B == r2.B) {
            Record r = new Record();
            r.A = r1.A;
            r.B = r1.B;
            r.C = r1.C | r2.C;
            return r;
        }
        if (r1.A == r2.A && r1.C == r2.C) {
            Record r = new Record();
            r.A = r1.A;
            r.B = r1.B | r2.B;
            r.C = r1.C;
            return r;
        }
        if (r1.B == r2.B && r1.C == r2.C) {
            Record r = new Record();
            r.A = r1.A | r2.A;
            r.B = r1.B;
            r.C = r1.C;
            return r;
        }
        throw new IllegalArgumentException("Unable to merge these two records!");
    }

Do you have any idea how to solve this problem?

like image 282
Danstahr Avatar asked Mar 06 '14 11:03

Danstahr


People also ask

What is triplet array?

What are triplets in an array? The triplet of an array is a tuple of three elements of different indices, represented by (i, j, k).

How do you create triplets in Java?

Syntax: Triplet<type1, type2, type3> triplet = new Triplet<type1, type2, type3>(value1, value2, value3); type1 val1 = triplet. getValue0();


1 Answers

This is going to be a very long answer, sadly without an optimal solution (sorry). It is however a serious attempt at applying greedy problem solving to your problem, so it may be useful in principle. I didn't implement the last approach discussed, perhaps that approach can yield the optimal solution -- I can't guarantee that though.

Level 0: Not really greedy

By definition, a greedy algorithm has a heuristic for choosing the next step in a way that is locally optimal, i.e. optimal right now, hoping to reach the global optimum which may or may not be possible always.

Your algorithm chooses any mergable pair and merges them and then moves on. It does no evaluation of what this merge implies and whether there is a better local solution. Because of this I wouldn't call your approach greedy at all. It is just a solution, an approach. I will call it the blind algorithm just so that I can succinctly refer to it in my answer. I will also use a slightly modified version of your algorithm, which, instead of removing two triplets and appending the merged triplet, removes only the second triplet and replaces the first one with the merged one. The order of the resulting triplets is different and thus the final result possibly too. Let me run this modified algorithm over a representative data set, marking to-be-merged triplets with a *:

0: 3 2 3   3 2 3   3 2 3
1: 0 1 0*  0 1 2   0 1 2
2: 1 2 0   1 2 0*  1 2 1
3: 0 1 2*
4: 1 2 1   1 2 1*
5: 0 2 0   0 2 0   0 2 0

Result: 4

Level 1: Greedy

To have a greedy algorithm, you need to formulate the merging decision in a way that allows for comparison of options, when multiple are available. For me, the intuitive formulation of the merging decision was:

If I merge these two triplets, will the resulting set have the maximum possible number of mergable triplets, when compared to the result of merging any other two triplets from the current set?

I repeat, this is intuitive for me. I have no proof that this leads to the globally optimal solution, not even that it will lead to a better-or-equal solution than the blind algorithm -- but it fits the definition of greedy (and is very easy to implement). Let's try it on the above data set, showing between each step, the possible merges (by indicating the indices of triplet pairs) and resulting number of mergables for each possible merge:

          mergables
0: 3 2 3  (1,3)->2
1: 0 1 0  (1,5)->1
2: 1 2 0  (2,4)->2
3: 0 1 2  (2,5)->2
4: 1 2 1
5: 0 2 0

Any choice except merging triplets 1 and 5 is fine, if we take the first pair, we get the same interim set as with the blind algorithm (I will this time collapse indices to remove gaps):

          mergables
0: 3 2 3  (2,3)->0
1: 0 1 2  (2,4)->1
2: 1 2 0
3: 1 2 1
4: 0 2 0

This is where this algorithm gets it differently: it chooses the triplets 2 and 4 because there is still one merge possible after merging them in contrast to the choice made by the blind algorithm:

          mergables
0: 3 2 3  (2,3)->0   3 2 3
1: 0 1 2             0 1 2
2: 1 2 0             1 2 1
3: 1 2 1

Result: 3

Level 2: Very greedy

Now, a second step from this intuitive heuristic is to look ahead one merge further and to ask the heuristic question then. Generalized, you would look ahead k merges further and apply the above heuristic, backtrack and decide the best option. This gets very verbose by now, so to exemplify, I will only perform one step of this new heuristic with lookahead 1:

          mergables
0: 3 2 3  (1,3)->(2,3)->0
1: 0 1 0         (2,4)->1*
2: 1 2 0  (1,5)->(2,4)->0
3: 0 1 2  (2,4)->(1,3)->0
4: 1 2 1         (1,4)->0
5: 0 2 0  (2,5)->(1,3)->1*
                 (2,4)->1*

Merge sequences marked with an asterisk are the best options when this new heuristic is applied.

In case a verbal explanation is necessary:

Instead of checking how many merges are possible after each possible merge for the starting set; this time we check how many merges are possible after each possible merge for each resulting set after each possible merge for the starting set. And this is for lookahead 1. For lookahead n, you'd be seeing a very long sentence repeating the part after each possible merge for each resulting set n times.

Level 3: Let's cut the greed

If you look closely, the previous approach has a disastrous perfomance for even moderate inputs and lookaheads(*). For inputs beyond 20 triplets anything beyond 4-merge-lookahead takes unreasonably long. The idea here is to cut out merge paths that seem to be worse than an existing solution. If we want to perform lookahead 10, and a specific merge path yields less mergables after three merges, than another path after 5 merges, we may just as well cut the current merge path and try another one. This should save a lot of time and allow large lookaheads which would get us closer to the globally optimal solution, hopefully. I haven't implemented this one for testing though.

(*): Assuming a large reduction of input sets is possible, the number of merges is proportional to input size, and lookahead approximately indicates how much you permute those merges. So you have choose lookahead from |input|, which is the binomial coefficient that for lookahead ≪ |input| can be approximated as O(|input|^lookahead) -- which is also (rightfully) written as you are thoroughly screwed.

Putting it all together

I was intrigued enough by this problem that I sat and coded this down in Python. Sadly, I was able to prove that different lookaheads yield possibly different results, and that even the blind algorithm occasionally gets it better than lookahead 1 or 2. This is a direct proof that the solution is not optimal (at least for lookahead ≪ |input|). See the source code and helper scripts, as well as proof-triplets on github. Be warned that, apart from memoization of merge results, I made no attempt at optimizing the code CPU-cycle-wise.

like image 54
Irfy Avatar answered Oct 05 '22 23:10

Irfy