The practical application of this problem is group assignment in a psychology study, but the theoretical formulation is this:
I have a matrix (the actual matrix is 27x72, but I'll pick a 4x8 as an example):
1 0 1 0
0 1 0 1
1 1 0 0
0 1 1 0
0 0 1 1
1 0 1 0
1 1 0 0
0 1 0 1
I want to pick half of the rows out of this matrix such that the column totals are equal (thus effectively creating two matrices with equivalent column totals). I cannot rearrange values within the rows.
I have tried some brute force solutions, but my matrix is too large for that to be effective, even having chosen some random restrictions first. It seems to me that the search space could be constrained with a better algorithm, but I haven't been able to think of one thus far. Any ideas? It is also possible that there is no solution, so an algorithm would have to be able to deal with that. I have been working in R, but I could switch to python easily.
Update Found a solution thanks to ljeabmreosn. Karmarkar-Karp worked great for an algorithm, and converting the rows to base 73 was inspired. I had a surprising hard time finding code that would actually give me the sub-sequences rather than just the final difference (maybe most people are only interested in this problem in the abstract?). Anyway this was the code:
First I converted my rows in to base 73 as the poster suggested. To do this I used the basein package in python, defining an alphabet with 73 characters and then using the basein.decode function to convert to decimel.
For the algorithm, I just added code to print the sub-sequence indices from this mailing list message from Tim Peters: https://mail.python.org/pipermail/tutor/2001-August/008098.html
from __future__ import nested_scopes
import sys
import bisect
class _Num:
def __init__(self, value, index):
self.value = value
self.i = index
def __lt__(self, other):
return self.value < other.value
# This implements the Karmarkar-Karp heuristic for partitioning a set
# in two, i.e. into two disjoint subsets s.t. their sums are
# approximately equal. It produces only one result, in O(N*log N)
# time. A remarkable property is that it loves large sets: in
# general, the more numbers you feed it, the better it does.
class Partition:
def __init__(self, nums):
self.nums = nums
sorted = [_Num(nums[i], i) for i in range(len(nums))]
sorted.sort()
self.sorted = sorted
def run(self):
sorted = self.sorted[:]
N = len(sorted)
connections = [[] for i in range(N)]
while len(sorted) > 1:
bigger = sorted.pop()
smaller = sorted.pop()
# Force these into different sets, by "drawing a
# line" connecting them.
i, j = bigger.i, smaller.i
connections[i].append(j)
connections[j].append(i)
diff = bigger.value - smaller.value
assert diff >= 0
bisect.insort(sorted, _Num(diff, i))
# Now sorted contains only 1 element x, and x.value is
# the difference between the subsets' sums.
# Theorem: The connections matrix represents a spanning tree
# on the set of index nodes, and any tree can be 2-colored.
# 2-color this one (with "colors" 0 and 1).
index2color = [None] * N
def color(i, c):
if index2color[i] is not None:
assert index2color[i] == c
return
index2color[i] = c
for j in connections[i]:
color(j, 1-c)
color(0, 0)
# Partition the indices by their colors.
subsets = [[], []]
for i in range(N):
subsets[index2color[i]].append(i)
return subsets
if not sys.argv:
print "error no arguments provided"
elif sys.argv[1]:
f = open(sys.argv[1], "r")
x = [int(line.strip()) for line in f]
N = 50
import math
p = Partition(x)
s, t = p.run()
sum1 = 0L
sum2 = 0L
for i in s:
sum1 += x[i]
for i in t:
sum2 += x[i]
print "Set 1:"
print s
print "Set 2:"
print t
print "Set 1 sum", repr(sum1)
print "Set 2 sum", repr(sum2)
print "difference", repr(abs(sum1 - sum2))
This gives the following output:
Set 1:
[0, 3, 5, 6, 9, 10, 12, 15, 17, 19, 21, 22, 24, 26, 28, 31, 32, 34, 36, 38, 41, 43, 45, 47, 48, 51, 53, 54, 56, 59, 61, 62, 65, 66, 68, 71]
Set 2:
[1, 2, 4, 7, 8, 11, 13, 14, 16, 18, 20, 23, 25, 27, 29, 30, 33, 35, 37, 39, 40, 42, 44, 46, 49, 50, 52, 55, 57, 58, 60, 63, 64, 67, 69, 70]
Set 1 sum 30309344369339288555041174435706422018348623853211009172L
Set 2 sum 30309344369339288555041174435706422018348623853211009172L
difference 0L
Which provides the indices of the proper subsets in a few seconds. Thanks everybody!
Assuming each entry in the matrix can either be 0 or 1, this problem seems to be in the same family as the Partition Problem which only has a pseudo-polynomial time algorithm. Let r
be the number of rows in the matrix and c
be the number of columns in the matrix. Then, encode each row to a c
-digit number of base r+1
. This is to ensure when adding each encoding, there is no need to carry, thus equivalent numbers in this base will equate to two sets of rows whose column sums are equivalent. So in your example, you would convert each row into a 4-digit number of base 9. This would yield the numbers (converted into base 10):
10109 => 73810
01019 => 8210
11009 => 81010
01109 => 9010
00119 => 1010
10109 => 73810
11009 => 81010
01019 => 8210
Although you probably couldn't use the pseudo-polynomial time algorithm with this method, you could use a simple heuristic with some decision trees to try to speed up the bruteforce. Using the numbers above, you could try to use the Karmarkar-Karp heuristic. Implemented below is the first step of algorithm in Python 3:
# Sorted (descending) => 810, 810, 738, 738, 90, 82, 82, 10
from queue import PriorityQueue
def karmarkar_karp_partition(arr):
pqueue = PriorityQueue()
for e in arr:
pqueue.put_nowait((-e, e))
for _ in range(len(arr)-1):
_, first = pqueue.get_nowait()
_, second = pqueue.get_nowait()
diff = first - second
pqueue.put_nowait((-diff, diff))
return pqueue.get_nowait()[1]
Here is the algorithm fully implemented. Note that this method is simply a heuristic and may fail to find the best partition.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With