Efficient way to sample a large array many times with NumPy?

Tags:

If you don't care about the details of what I'm trying to implement, just skip past the lower horizontal line

I am trying to do a bootstrap error estimation on some statistic with NumPy. I have an array x, and wish to compute the error on the statistic f(x) for which usual gaussian assumptions in error analysis do not hold. x is very large.

To do this, I resample x using numpy.random.choice(), where the size of my resample is the size of the original array, with replacement:

resample = np.random.choice(x, size=len(x), replace=True)

This gives me a new realization of x. This operation must now be repeated ~1,000 times to give an accurate error estimate. If I generate 1,000 resamples of this nature;

resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]

and then compute the statistic f(x) on each realization;

results = [f(arr) for arr in resamples]

then I have inferred the error of f(x) to be something like

np.std(results)

the idea being that even though f(x) itself cannot be described using gaussian error analysis, a distribution of f(x) measures subject to random error can be.

Okay, so that's a bootstrap. Now, my problem is that the line

resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]

is very slow for large arrays. Is there a smarter way to do this without a list comprehension? The second list comprehension

results = [f(arr) for arr in resamples]

can be pretty slow too, depending on the details of the function f(x).

493

asked Oct 24 '17 17:10

pretzlstyle

1 Answers

Since we are allowing repetitions, we could generate all the indices in one go with np.random.randint and then simply index to get resamples equivalent, like so -

num_samples = 1000
idx = np.random.randint(0,len(x),size=(num_samples,len(x)))
resamples_arr = x[idx]

One more approach would be to generate random number from uniform distribution with numpy.random.rand and scale to length of array, like so -

resamples_arr = x[(np.random.rand(num_samples,len(x))*len(x)).astype(int)]

Runtime test with x of 5000 elems -

In [221]: x = np.random.randint(0,10000,(5000))

# Original soln
In [222]: %timeit [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]
10 loops, best of 3: 84 ms per loop

# Proposed soln-1
In [223]: %timeit x[np.random.randint(0,len(x),size=(1000,len(x)))]
10 loops, best of 3: 76.2 ms per loop

# Proposed soln-2
In [224]: %timeit x[(np.random.rand(1000,len(x))*len(x)).astype(int)]
10 loops, best of 3: 59.7 ms per loop

For very large x

With a very large array x of 600,000 elements, you might not want to create all those indices for 1000 samples. In that case, per sample solution would have their timings something like this -

In [234]: x = np.random.randint(0,10000,(600000))

# Original soln
In [235]: %timeit np.random.choice(x, size=len(x), replace=True)
100 loops, best of 3: 13 ms per loop

# Proposed soln-1
In [238]: %timeit x[np.random.randint(0,len(x),len(x))]
100 loops, best of 3: 12.5 ms per loop

# Proposed soln-2
In [239]: %timeit x[(np.random.rand(len(x))*len(x)).astype(int)]
100 loops, best of 3: 9.81 ms per loop

answered Oct 08 '22 13:10

Divakar

Related questions
                            
                                The table-striped class is not giving me alternate color
                            
                                How to install miniconda on Ubuntu automatically
                            
                                Rolling Window In Pandas - Explanation
                            
                                Why return type is not checked in python3? [duplicate]
                            
                                Keras model.predict always 0
                            
                                Django Admin, sort with custom function
                            
                                Increasing bar width in bar chart using Altair
                            
                                How do you properly integrate unit tests for file parsing with pytest?
                            
                                Merge the first row with the column headers in a dataframe
                            
                                Pandas - Get unique values from column along with lists of row indices where they appear
                            
                                Trying to understand scipy and numpy interpolation
                            
                                How to parse the output received by gRPC stub client from tensorflow serving server?
                            
                                Count number of special characters [^&$#] appearing in a paragraph
                            
                                python rstrip or remove end of string by a pattern of characters
                            
                                Insert a node into an abstract syntax tree
                            
                                Converting raw file content from Flask file upload into dataframe using pandas
                            
                                Pandas error in Python: columns must be same length as key
                            
                                How To Push a Spark Dataframe to Elastic Search (Pyspark)
                            
                                git reset --hard HEAD vs git checkout <file>
                            
                                Thicken a one pixel line

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient way to sample a large array many times with NumPy?

Tags:

python

optimization

list-comprehension

numpy

statistics

pretzlstyle

People also ask

1 Answers

Divakar

Recent Activity

Donate For Us