I have a snippet of code that takes a large data set and turns it into proportional smaller data set let me explain:
Lets say you have 20 blue marbles and 10 red marbles, if I wanted to represent this data with 3 marbles I would use 2 blue and 1 red marble.
I don't mind if its not exact for example representing 17 blue and 16 red with 4 marbles. The closest way to proportionally represent it would be with 2 blue and 2 red and that's okay
This is my code in python :
from random import randrange
data_set = [randrange(100, 1000) for x in range(5)]
required_amount = 20
special_number = required_amount / sum(data_set)
proportional_data_set = [round(x * special_number) for x in data_set]
print(data_set)
print(required_amount)
print(proportional_data_set)
print(sum(proportional_data_set))
The problem is that I say the required sample is 20 but sometimes the proportional data set will give me 21 or 19 total. I'm assuming it's because of some rounding error but does anyone know a better way to solve this problem?
A sample output working correctly would be :
[832, 325, 415, 385, 745]
20
[6, 2, 3, 3, 6]
20
A sample working incorrectly would be :
[414, 918, 860, 978, 438]
20
[2, 5, 5, 5, 2]
19
If anyone knows of any similar methods that do something like this that would also be great.
Here's one way to solve the problem. Calculate special_number as the number of units in the data_set per "marble". Then use divmod() to calculate both the proportional amount and the remainder. Since divmod() returns an integer quotient, sum(proportional_data_set) would be less than required_amount in most cases.
Finally, use a loop to find the highest remainder and increment the proportional_data_set until sum(proportional_data_set) = required_amount.
from random import randrange
data_set = [randrange(100, 1000) for x in range(5)]
required_amount = 20
special_number = sum(data_set) // required_amount
print("Data set:")
print(data_set)
print("Special number:")
print(special_number)
# divmod() returns a pair of numbers, split them into quotients and remainders
pairs = [divmod(x, special_number) for x in data_set]
proportional_data_set = [x[0] for x in pairs]
remainder = [x[1] for x in pairs]
print
print("Proportional data set before adjusting:")
print(proportional_data_set), "=", sum(proportional_data_set)
print("Remainders:")
print(remainder)
while sum(proportional_data_set) < required_amount:
i = remainder.index(max(remainder)) # index of the highest remainder
proportional_data_set[i] += 1 # add another marble to this index
remainder[i] = -1 # don't use this remainder again
print
print("Proportional data set after adjusting:")
print(proportional_data_set), "=", sum(proportional_data_set)
print("Remainders:")
print(remainder)
The output looks like:
Data set:
[546, 895, 257, 226, 975]
Special number:
144
Proportional data set before adjusting:
[3, 6, 1, 1, 6] = 17
Remainders:
[114, 31, 113, 82, 111]
Proportional data set after adjusting:
[4, 6, 2, 1, 7] = 20
Remainders:
[-1, 31, -1, 82, -1]
The highest remainders were used to increment the proportional data set and then set to -1.
I was going to offer a solution based on a Bresenham line between the cumulative sum of the input data and the cumulative sum of the proportional output values but (a) it turns out to give the wrong answer - see below - and (b) I believe @tzaman's pointer to Allocate an array of integers proportionally compensating for rounding errors provides a simpler solution than any correction that I could make to the Bresenham method (the proportional() function is by @Dr. Goulu):
def proportional(nseats,votes):
"""assign n seats proportionaly to votes using Hagenbach-Bischoff quota
:param nseats: int number of seats to assign
:param votes: iterable of int or float weighting each party
:result: list of ints seats allocated to each party
"""
quota=sum(votes)/(1.+nseats) #force float
frac=[vote/quota for vote in votes]
res=[int(f) for f in frac]
n=nseats-sum(res) #number of seats remaining to allocate
if n==0: return res #done
if n<0: return [min(x,nseats) for x in res] # see siamii's comment
#give the remaining seats to the n parties with the largest remainder
remainders=[ai-bi for ai,bi in zip(frac,res)]
limit=sorted(remainders,reverse=True)[n-1]
#n parties with remainter larger than limit get an extra seat
for i,r in enumerate(remainders):
if r>=limit:
res[i]+=1
n-=1 # attempt to handle perfect equality
if n==0: return res #done
raise #should never happen
print (proportional(20,[832, 325, 415, 385, 745]))
print (proportional(20,[414, 918, 860, 978, 438]))
... gives the output:
[6, 2, 3, 3, 6]
[2, 5, 5, 6, 2]
... as required.
For those who may be interested in the Bresenham line (non-)solution, here it is, based on the code here:
import itertools, operator
def bresenhamLine(x0, y0, x1, y1):
dx = abs(x1 - x0)
dy = abs(y1 - y0)
sx = x0 < x1 and 1 or -1
sy = y0 < y1 and 1 or -1
err = dx - dy
points = []
x, y = x0, y0
while True:
points += [(x, y)]
if x == x1 and y == y1:
break
e2 = err * 2
if e2 > -dy:
err -= dy
x += sx
if e2 < dx:
err += dx
y += sy
return points
def proportional(n,inp):
cumsum = list(itertools.accumulate(inp))
pts = bresenhamLine(0,0,max(cumsum),n)
yval = [y for x,y in pts]
cumsum2 = [yval[x] for x in cumsum]
res = [cumsum2[0]]
for i,x in enumerate(cumsum2[1:]):
res.append(x-cumsum2[i])
return res
print (proportional(20,[832, 325, 415, 385, 745]))
print (proportional(20,[414, 918, 860, 978, 438]))
... however the output is
[6, 3, 3, 2, 6]
[2, 5, 5, 6, 2]
... which is incorrect because for the second to fourth items in the first list, it assigns "2" to the middle-ranked item rather than the lowest-ranked item. The Hagenbach-Bischoff quota method gets this allocation correct.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With