I have a list of data points. For the full run of my program, I'll use all of the data points, but for testing of the code, I want to use only a small percentage of them in order that the program run in a short time. I do not want simply to take the first n elements of the list, though; I want to select an even distribution of the elements from the list. So, if I'm using 50% of the data points, I might want to select from the list of data points every second data points.
Basically, I want to have a function that takes as arguments a list and a percentage and returns a list consisting of an even distribution of elements from the input list, the number of which corresponds as closely as possible to the percentage requested.
What would be a good way to do this?
For completeness, consider the following.
The problem can be split up in two part:
Determine the number of elements to pick, given a certain percentage or fraction.
Choose which elements from the list should be picked.
The first point is straight forward. If you would like to have percentage = 35. #%
of your list, you'd ideally choose round(len(my_list) * (percentage / 100.))
elements. Note that you'd get the percentage exactly right only if len(my_list)
is a multiple of (percentage / 100.)
. This inaccuracy is inevitable, as a continuous measure (percentage) is converted into a discrete one (nbr. of elements).
The second point will depend upon the special requirements you have towards which element should be returned. Choosing elements as equally distributed as possible is doable but certainly not the easiest way.
Here is how you would do this conceptually (see below for an implementation):
If you have a list of length l
whereof you'd like a certain equally distributed fraction f
(f = percentage / 100.
) you will have to bin the indexes of you list into round(l * f)
bins of the size l / round(l * f)
. What you want is the list with the most central elements for each bin.
Why does this work?
For the first point, note that if we make bins of the size l / round(l * f)
we will get l / l / round(l * f) = round(l * f)
bins at the end. Which is the ideal amount (see point 1 above). If for each of these equally sized bins we then choose the most central element, we get a list of elements that are as equally distributed as possible.
Here is a simple (and neither speed optimized nor very pretty) implementation of this:
from bisect import bisect_left
def equal_dist_els(my_list, fraction):
"""
Chose a fraction of equally distributed elements.
:param my_list: The list to draw from
:param fraction: The ideal fraction of elements
:return: Elements of the list with the best match
"""
length = len(my_list)
list_indexes = range(length)
nbr_bins = int(round(length * fraction))
step = length / float(nbr_bins) # the size of a single bin
bins = [step * i for i in xrange(nbr_bins)] # list of bin ends
# distribute indexes into the bins
splits = [bisect_left(list_indexes, wall) for wall in bins]
splits.append(length) # add the end for the last bin
# get a list of (start, stop) indexes for each bin
bin_limits = [(splits[i], splits[i + 1]) for i in xrange(len(splits) - 1)]
out = []
for bin_lim in bin_limits:
f, t = bin_lim
in_bin = my_list[f:t] # choose the elements in my_list belonging in this bin
out.append(in_bin[int(0.5 * len(in_bin))]) # choose the most central element
return out
We can now compare this ideal algorithm (equal_dist_els) with the slicing approach by @jonrsharpe:
See further below for the code.
Along the x-axis is the wished fraction of elements to be returned and on the y-axis is the absolute difference between the wanted fraction and the fraction returned by the two methods. We see that for fractions around 0.7 (~70%) the deviation of the slicing method is remarkable, i.e. if you ask for ~70% the slicing method returns all the elements (100%) which is a deviation of almost 45%.
To conclude we can say that the slicing method by @jonrsharpe works well for small fractions (>>0.1
) but becomes increasingly inaccurate when choosing bigger fractions. Also note that the inaccuracy is independent of the length of the list. The binning algorithm is certainly slightly more complicated to implement and most probably also much slower. However its inaccuracy is just given by the inevitable inaccuracy mentioned above which decreases with increasing length of the list.
code for the plots:
from matplotlib import pyplot as plt
# def of equal_dist_els see above
def select_els(seq, perc):
"""Select a defined percentage of the elements of seq."""
return seq[::int(round(1./perc if perc != 0 else 0))]
list_length = 50
my_list = range(list_length)
percentages = range(1, 101)
fracts = map(lambda x: x * 0.01, percentages)
equal_dist = map(lambda x: abs(len(equal_dist_els(my_list, x)) / float(len(my_list)) - x), fracts)
slicing = map(lambda x: abs(len(select_els(my_list, x)) / float(len(my_list)) - x), fracts)
plt.plot(fracts, equal_dist, color='blue', alpha=0.8, linewidth=2, label=r'equal_dist_elements')
plt.plot(fracts, slicing, color='red', alpha=0.8, linewidth=2, label=r'select_elements by @jonrsharpe')
plt.title('Choosing equally dist. fraction of els from a list of length %s' % str(list_length))
plt.xlabel('requested fraction')
plt.ylabel('absolute deviation')
plt.legend(loc='upper left')
plt.show()
This can trivially be achieved by setting a slice with a step:
def select_elements(seq, perc):
"""Select a defined percentage of the elements of seq."""
return seq[::int(100.0/perc)]
In use:
>>> select_elements(range(10), 50)
[0, 2, 4, 6, 8]
>>> select_elements(range(10), 33)
[0, 3, 6, 9]
>>> select_elements(range(10), 25)
[0, 4, 8]
You could also add round
, as int
will truncate:
>>> int(3.6)
3
>>> int(round(3.6))
4
If you want to use a proportion rather than a percentage (e.g. 0.5
instead of 50
), simply replace 100.0
with 1
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With