If you don't care about the details of what I'm trying to implement, just skip past the lower horizontal line
I am trying to do a bootstrap error estimation on some statistic with NumPy. I have an array x
, and wish to compute the error on the statistic f(x)
for which usual gaussian assumptions in error analysis do not hold. x
is very large.
To do this, I resample x
using numpy.random.choice()
, where the size of my resample is the size of the original array, with replacement:
resample = np.random.choice(x, size=len(x), replace=True)
This gives me a new realization of x
. This operation must now be repeated ~1,000 times to give an accurate error estimate. If I generate 1,000 resamples of this nature;
resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]
and then compute the statistic f(x)
on each realization;
results = [f(arr) for arr in resamples]
then I have inferred the error of f(x)
to be something like
np.std(results)
the idea being that even though f(x)
itself cannot be described using gaussian error analysis, a distribution of f(x)
measures subject to random error can be.
Okay, so that's a bootstrap. Now, my problem is that the line
resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]
is very slow for large arrays. Is there a smarter way to do this without a list comprehension? The second list comprehension
results = [f(arr) for arr in resamples]
can be pretty slow too, depending on the details of the function f(x)
.
NumPy: repeat() function The repeat() function is used to repeat elements of an array. Input array. The number of repetitions for each element. repeats is broadcasted to fit the shape of the given axis.
NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.
Looping over Python arrays, lists, or dictionaries, can be slow. Thus, vectorized operations in Numpy are mapped to highly optimized C code, making them much faster than their standard Python counterparts.
There is no general maximum array size in numpy. Of course there is, it is the size of np. intp datatype. Which for 32bit versions may only be 32bits...
Since we are allowing repetitions, we could generate all the indices in one go with np.random.randint
and then simply index to get resamples
equivalent, like so -
num_samples = 1000
idx = np.random.randint(0,len(x),size=(num_samples,len(x)))
resamples_arr = x[idx]
One more approach would be to generate random number from uniform distribution with numpy.random.rand
and scale to length of array, like so -
resamples_arr = x[(np.random.rand(num_samples,len(x))*len(x)).astype(int)]
Runtime test with x
of 5000
elems -
In [221]: x = np.random.randint(0,10000,(5000))
# Original soln
In [222]: %timeit [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]
10 loops, best of 3: 84 ms per loop
# Proposed soln-1
In [223]: %timeit x[np.random.randint(0,len(x),size=(1000,len(x)))]
10 loops, best of 3: 76.2 ms per loop
# Proposed soln-2
In [224]: %timeit x[(np.random.rand(1000,len(x))*len(x)).astype(int)]
10 loops, best of 3: 59.7 ms per loop
For very large x
With a very large array x
of 600,000
elements, you might not want to create all those indices for 1000
samples. In that case, per sample solution would have their timings something like this -
In [234]: x = np.random.randint(0,10000,(600000))
# Original soln
In [235]: %timeit np.random.choice(x, size=len(x), replace=True)
100 loops, best of 3: 13 ms per loop
# Proposed soln-1
In [238]: %timeit x[np.random.randint(0,len(x),len(x))]
100 loops, best of 3: 12.5 ms per loop
# Proposed soln-2
In [239]: %timeit x[(np.random.rand(len(x))*len(x)).astype(int)]
100 loops, best of 3: 9.81 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With