Algorithm to efficiently select rows from a matrix such that column totals are equal

Question

The practical application of this problem is group assignment in a psychology study, but the theoretical formulation is this:

I have a matrix (the actual matrix is 27x72, but I'll pick a 4x8 as an example):

I want to pick half of the rows out of this matrix such that the column totals are equal (thus effectively creating two matrices with equivalent column totals). I cannot rearrange values within the rows.

I have tried some brute force solutions, but my matrix is too large for that to be effective, even having chosen some random restrictions first. It seems to me that the search space could be constrained with a better algorithm, but I haven't been able to think of one thus far. Any ideas? It is also possible that there is no solution, so an algorithm would have to be able to deal with that. I have been working in R, but I could switch to python easily.

Update Found a solution thanks to ljeabmreosn. Karmarkar-Karp worked great for an algorithm, and converting the rows to base 73 was inspired. I had a surprising hard time finding code that would actually give me the sub-sequences rather than just the final difference (maybe most people are only interested in this problem in the abstract?). Anyway this was the code:

First I converted my rows in to base 73 as the poster suggested. To do this I used the basein package in python, defining an alphabet with 73 characters and then using the basein.decode function to convert to decimel.

For the algorithm, I just added code to print the sub-sequence indices from this mailing list message from Tim Peters: https://mail.python.org/pipermail/tutor/2001-August/008098.html

from __future__ import nested_scopes

import sys
import bisect

class _Num:
    def __init__(self, value, index):
        self.value = value
        self.i = index

    def __lt__(self, other):
        return self.value < other.value

# This implements the Karmarkar-Karp heuristic for partitioning a set
# in two, i.e. into two disjoint subsets s.t. their sums are
# approximately equal.  It produces only one result, in O(N*log N)
# time.  A remarkable property is that it loves large sets:  in
# general, the more numbers you feed it, the better it does.

class Partition:
    def __init__(self, nums):
        self.nums = nums
        sorted = [_Num(nums[i], i) for i in range(len(nums))]
        sorted.sort()
        self.sorted = sorted

    def run(self):
        sorted = self.sorted[:]
        N = len(sorted)
        connections = [[] for i in range(N)]

        while len(sorted) > 1:
            bigger  = sorted.pop()
            smaller = sorted.pop()

            # Force these into different sets, by "drawing a
            # line" connecting them.
            i, j = bigger.i, smaller.i
            connections[i].append(j)
            connections[j].append(i)

            diff = bigger.value - smaller.value
            assert diff >= 0
            bisect.insort(sorted, _Num(diff, i))

        # Now sorted contains only 1 element x, and x.value is
        # the difference between the subsets' sums.

        # Theorem:  The connections matrix represents a spanning tree
        # on the set of index nodes, and any tree can be 2-colored.
        # 2-color this one (with "colors" 0 and 1).

        index2color = [None] * N

        def color(i, c):
            if index2color[i] is not None:
                assert index2color[i] == c
                return
            index2color[i] = c
            for j in connections[i]:
                color(j, 1-c)

        color(0, 0)

        # Partition the indices by their colors.
        subsets = [[], []]
        for i in range(N):
            subsets[index2color[i]].append(i)

        return subsets

if not sys.argv:
  print "error no arguments provided"
elif sys.argv[1]:
  f = open(sys.argv[1], "r")
  x = [int(line.strip()) for line in f]

N = 50
import math
p = Partition(x)
s, t = p.run()
sum1 = 0L
sum2 = 0L
for i in s:
    sum1 += x[i]
for i in t:
    sum2 += x[i]

print "Set 1:"
print s
print "Set 2:"
print t
print "Set 1 sum", repr(sum1)
print "Set 2 sum", repr(sum2)
print "difference", repr(abs(sum1 - sum2))

This gives the following output:

Set 1:
[0, 3, 5, 6, 9, 10, 12, 15, 17, 19, 21, 22, 24, 26, 28, 31, 32, 34, 36, 38, 41, 43, 45, 47, 48, 51, 53, 54, 56, 59, 61, 62, 65, 66, 68, 71]
Set 2:
[1, 2, 4, 7, 8, 11, 13, 14, 16, 18, 20, 23, 25, 27, 29, 30, 33, 35, 37, 39, 40, 42, 44, 46, 49, 50, 52, 55, 57, 58, 60, 63, 64, 67, 69, 70]
Set 1 sum 30309344369339288555041174435706422018348623853211009172L
Set 2 sum 30309344369339288555041174435706422018348623853211009172L
difference 0L

Which provides the indices of the proper subsets in a few seconds. Thanks everybody!

ljeabmreosn · Accepted Answer

Assuming each entry in the matrix can either be 0 or 1, this problem seems to be in the same family as the Partition Problem which only has a pseudo-polynomial time algorithm. Let r be the number of rows in the matrix and c be the number of columns in the matrix. Then, encode each row to a c-digit number of base r+1. This is to ensure when adding each encoding, there is no need to carry, thus equivalent numbers in this base will equate to two sets of rows whose column sums are equivalent. So in your example, you would convert each row into a 4-digit number of base 9. This would yield the numbers (converted into base 10):

1010₉ => 738₁₀

0101₉ => 82₁₀

1100₉ => 810₁₀

0110₉ => 90₁₀

0011₉ => 10₁₀

1010₉ => 738₁₀

1100₉ => 810₁₀

0101₉ => 82₁₀

Although you probably couldn't use the pseudo-polynomial time algorithm with this method, you could use a simple heuristic with some decision trees to try to speed up the bruteforce. Using the numbers above, you could try to use the Karmarkar-Karp heuristic. Implemented below is the first step of algorithm in Python 3:

# Sorted (descending) => 810, 810, 738, 738, 90, 82, 82, 10

from queue import PriorityQueue

def karmarkar_karp_partition(arr):
    pqueue = PriorityQueue()

    for e in arr:
        pqueue.put_nowait((-e, e))

    for _ in range(len(arr)-1):
        _, first = pqueue.get_nowait()
        _, second = pqueue.get_nowait()
        diff = first - second
        pqueue.put_nowait((-diff, diff))

    return pqueue.get_nowait()[1]

Here is the algorithm fully implemented. Note that this method is simply a heuristic and may fail to find the best partition.

Algorithm to efficiently select rows from a matrix such that column totals are equal

Tags:

algorithm

matrix

mrpeverill

1 Answers

ljeabmreosn

Recent Activity

Donate For Us

Algorithm to efficiently select rows from a matrix such that column totals are equal

Tags:

algorithm

matrix

mrpeverill

1 Answers

ljeabmreosn

Related questions

Recent Activity

Donate For Us