Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Minimize the sum of errors of representative integers

Given n integers between [0,10000] as D1,D2...,Dn, where there may be duplicates, and n can be huge:

I want to find k distinct representative integers (for example k=5) between [0,10000] as R1,R2,...,Rk, so the sum of errors of all the representative integers is minimized.

The error of a representative integer is defined below:

Assuming we have k representative integers in ascending order as {R1,R2...,Rk}, the error of Ri is : enter image description here

and i want to minimize the sum of errors of the k representative integers:

enter image description here

how can this be done efficiently?

EDIT1: The smallest of the k representative integers must be the smallest number in {D1,D2...,Dn}

EDIT2: The largest of the k representative integers must be the largest number in {D1,D2...,Dn} plus 1. For example when the largest number in {D1,D2...,Dn} is 9787 then Rk is 9788.

EDIT3: A concrete example is here:

D={1,3,3,7,8,14,14,14,30} and if k=5 and R is picked as {1,6,10,17,31} then the sum of errors is :

sum of errors=(1-1)+(3-1)*2+(7-6)+(8-6)+(14-10)*3+(30-17)=32

this is because 1<=1,3,3<6 , 6<=7,8<10, 10<=14,14,14<17, 17<=30<31

like image 204
outlaw Avatar asked Mar 02 '11 12:03

outlaw


3 Answers

Although with help from the community you have managed to state your problem in a form which is understandable mathematically, you still do not supply enough information to enable me (or anyone else) to give you a definitive answer (I would have posted this as a comment, but for some reason I don't see the "add comment" option is available to me). In order to give a good answer to this problem, we need to know the relative sizes of n and k versus each other and 10000 (or the expected maximum of the Di if it isn't 10000), and whether it is critical that you attain the exact minimum (even if this requires an exorbitant amount of time for the calculation) or if a close approximation would be OK, also (and if so, how close do you need to get). In addition, in order to know what algorithm runs in minimum time, we need to have an understanding about what kind of hardware is going to run the algorithm (i.e., do we have m CPU cores to run on in parallel and what is the size of m relative to k).

Another important piece of information is whether this problem will be solved only once, or it must be solved many times but there exists some connection between the distributions of the integers Di from one problem to the next (e.g., the integers Di are all random samples from a particular, unchanging probability distribution, or perhaps each successive problem has as its input an ever-increasing set which is the set from the previous problem plus an extra s integers).

No reasonable algorithm for your problem should run in time which depends in a way greater than linear in n, since building a histogram of the n integers Di requires O(n) time, and the answer to the optimization problem itself is only dependent on the histogram of the integers and not on their ordering. No algorithm can run in time less than O(n), since that is the size of the input of the problem.

A brute force search over all of the possibilities requires, (assuming that at least one of the Di is 0 and another is 10000), for small k, say k < 10, approximately O(10000k-2) time, so if log10(n) >> 4(k-2), this is the optimal algorithm (since in this case the time for the brute force search is insignificant compared to the time to read the input). It is also interesting to note that if k is very close to 10000, then a brute force search only requires O(1000010002-k) (because we can search instead over the integers which are not used as representative integers).

If you update the definition of the problem with more information, I will attempt to edit my answer in turn.

like image 135
Ron Kaminsky Avatar answered Nov 17 '22 09:11

Ron Kaminsky


Now the question is clarified, we observe the Ri divide the Dx into k-1 intervals, [R1,R2), [R2,R3), ... [Rk-1,Rk). Every Dx belongs to exactly one of those intervals. Let qi be the number of Dx in the interval [Ri,Ri+1), and let si be the sum of those Dx. Then each error(Ri) expression is the sum of qi terms and evaluates to si - qiRi.

Summing that over all i, we get a total error of S - sum(qiRi), where S is the sum of all the Dx. So the problem is to choose the Ri to maximize sum(qiRi). Remember each qi is the number of original data at least as large as Ri, but smaller than the next one.

Any global maximum must be a local maximum; so we imagine increasing or decreasing one of the Ri. If Ri is not one of the original data values, then we can increase it without changing any of the qi and improve our target function. So an optimal solution has each Ri (except the limiting last one) as one of the data values. I got a bit bogged down in math after that, but it seems a sensible approach is to pick the initial Ri as every (n/k)th data value (simple percentiles), then iteratively seeing if moving the R_i to the previous or next value improves the score and thus decreases the error. (The qiRi seems easier to work with, since you can read the data and count repetitions and update qi, Ri by only looking at a single data/count point. You only need to store an array of 10,000 data counts, no matter how huge the data).

data:   1  3  7  8 14 30
count:  1  2  1  1  3  1     sum(data) = 94

initial R: 1  3  8  14  31
initial Q: 1  3  1   4        sum(QR)  = 74 (hence error = 20)

In this example, we could try changing the 3 or the 8 to a 7, For example if we increase the 3 to 7, then we see there are 2 3's in the initial data, so the first two Q's become 1+2, 3-2 - it turns out this decreases sum(QR)). I'm sure there are smarter patterns to detect what changes in the QR table are viable, but this seems workable.

like image 26
Chris Nash Avatar answered Nov 17 '22 09:11

Chris Nash


If the distribution is near random and the selection (n) is large enough, you are wasting time, generally, trying to optimize for what will amount to real costs in time calculating to gain decreasing improvements in % from expected averages. The fastest average solution is to set the lower k-1 at the low end of intervals M/(k-1), where M is the lowest upper bound - the greatest lower bound (ie, M = max number possible - 0) and the last k at M+1. It would take order k (the best we can do with the information presented in this problem) to figure those values out. Stating what I just did is not a proof of course.

My point is this. The above discussion is one simplification that I think is very practical for one large class of sets. At the other end, it's straightforward to compute every error possible for all permutations and then select the smallest one. The running time for this makes that solution intractable in many cases. That the person asking this question expects more than the most direct and exact (intractable) answer leaves much that is open-ended. We can trim at the edges from here to eternity trying to quantify all sorts of properties along the infinite solution space for all possible permutations (or combinations) of n numbers and all k values.

like image 1
Jose_X Avatar answered Nov 17 '22 08:11

Jose_X