I'm looking for an efficient way to split an array into chunks that have a similar histogram.
For example, given the array:
l = np.array([1, 2, 3, 4, 1, 1, 1, 2, 3, 4, 55, 5, 5, 5, 5, 5, 3, 2, 2, 21, 2])
I would like to obtain:
c1 = np.array([1, 4, 1, 5, 5, 3, 2])
c2 = np.array([2, 1, 3, 4, 5, 5, 2])
c3 = np.array([3, 1, 2, 55, 5, 2, 21])
Not only this but each chunk should have a similar size and a similar sum on a given function f:
1. |sum(ci, f) - sum(cj, f)| < e, for i != j
2. |len(ci) - len(cj)| < e, for i != j
where
sum(c, f) = f(c[0]) + ... + f(c[len(c)])
Edit:
To clarify the intention of this. I want to parallelize a process over a list into n
sub-processes but the cost has to be distributed among each sub-process evenly. The cost of processing an element in this list is a function f
of the integer at the same position in array l
, where f is the computational complexity of the process. For example, f(i)=i^2
. So, I want all processes to have the same computational cost and not to have processes that finish too early while others last forever.
Let's start with a very weak assumption that similar histograms are defined in a following rudimentary way: for a set of integers S1, the histogram H(S1)
is similar to an histogram H(S2)
of set S2 if sum(S1) = sum(S2)
.
Generally you are after finding subsets S1, S2, ..., SN of an array A such that f(S1) = f(S2) = ... = f(SN)
, and where under our assumptions f=sum
. Unfortunately you've got yourself a k-Partition problem, which is NP-hard, and if you'll get someone to find an efficient way (ie poly-time) to do that, as you requested, the consequence would be that P=NP
was proven to be true first on stackoverflow!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With