Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to sample a large array many times with NumPy?

If you don't care about the details of what I'm trying to implement, just skip past the lower horizontal line

I am trying to do a bootstrap error estimation on some statistic with NumPy. I have an array x, and wish to compute the error on the statistic f(x) for which usual gaussian assumptions in error analysis do not hold. x is very large.

To do this, I resample x using numpy.random.choice(), where the size of my resample is the size of the original array, with replacement:

resample = np.random.choice(x, size=len(x), replace=True)

This gives me a new realization of x. This operation must now be repeated ~1,000 times to give an accurate error estimate. If I generate 1,000 resamples of this nature;

resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]

and then compute the statistic f(x) on each realization;

results = [f(arr) for arr in resamples]

then I have inferred the error of f(x) to be something like

np.std(results)

the idea being that even though f(x) itself cannot be described using gaussian error analysis, a distribution of f(x) measures subject to random error can be.


Okay, so that's a bootstrap. Now, my problem is that the line

resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]

is very slow for large arrays. Is there a smarter way to do this without a list comprehension? The second list comprehension

results = [f(arr) for arr in resamples]

can be pretty slow too, depending on the details of the function f(x).

like image 493
pretzlstyle Avatar asked Oct 24 '17 17:10

pretzlstyle


People also ask

How do you repeat an array in NumPy?

NumPy: repeat() function The repeat() function is used to repeat elements of an array. Input array. The number of repetitions for each element. repeats is broadcasted to fit the shape of the given axis.

Are NumPy arrays more efficient?

NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.

Why NumPy array operations are faster than looping?

Looping over Python arrays, lists, or dictionaries, can be slow. Thus, vectorized operations in Numpy are mapped to highly optimized C code, making them much faster than their standard Python counterparts.

What is the maximum size of NumPy array?

There is no general maximum array size in numpy. Of course there is, it is the size of np. intp datatype. Which for 32bit versions may only be 32bits...


1 Answers

Since we are allowing repetitions, we could generate all the indices in one go with np.random.randint and then simply index to get resamples equivalent, like so -

num_samples = 1000
idx = np.random.randint(0,len(x),size=(num_samples,len(x)))
resamples_arr = x[idx]

One more approach would be to generate random number from uniform distribution with numpy.random.rand and scale to length of array, like so -

resamples_arr = x[(np.random.rand(num_samples,len(x))*len(x)).astype(int)]

Runtime test with x of 5000 elems -

In [221]: x = np.random.randint(0,10000,(5000))

# Original soln
In [222]: %timeit [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]
10 loops, best of 3: 84 ms per loop

# Proposed soln-1
In [223]: %timeit x[np.random.randint(0,len(x),size=(1000,len(x)))]
10 loops, best of 3: 76.2 ms per loop

# Proposed soln-2
In [224]: %timeit x[(np.random.rand(1000,len(x))*len(x)).astype(int)]
10 loops, best of 3: 59.7 ms per loop

For very large x

With a very large array x of 600,000 elements, you might not want to create all those indices for 1000 samples. In that case, per sample solution would have their timings something like this -

In [234]: x = np.random.randint(0,10000,(600000))

# Original soln
In [235]: %timeit np.random.choice(x, size=len(x), replace=True)
100 loops, best of 3: 13 ms per loop

# Proposed soln-1
In [238]: %timeit x[np.random.randint(0,len(x),len(x))]
100 loops, best of 3: 12.5 ms per loop

# Proposed soln-2
In [239]: %timeit x[(np.random.rand(len(x))*len(x)).astype(int)]
100 loops, best of 3: 9.81 ms per loop
like image 64
Divakar Avatar answered Oct 08 '22 13:10

Divakar