<p>I would like to draw a bootstrap sample of a <code>pandas.DataFrame</code> as efficiently as possible. Using the builtin <code>iloc</code> together with a list of integers seems to be slow:</p> <pre class="prettyprint"><code>import pandas import numpy as np # Generate some data n = 5000 values = np.random.uniform(size=(n, 5)) # Construct a pandas.DataFrame columns = ['a', 'b', 'c', 'd', 'e'] df = pandas.DataFrame(values, columns=columns) # Bootstrap %timeit df.iloc[np.random.randint(n, size=n)] # Out: 1000 loops, best of 3: 1.46 ms per loop </code></pre> <p>Indexing the <code>numpy</code> array is of course much faster:</p> <pre class="prettyprint"><code>%timeit values[np.random.randint(n, size=n)] # Out: 10000 loops, best of 3: 159 µs per loop </code></pre> <p>But even extracting the values, sampling the <code>numpy</code> array, and constructing a new <code>pandas.DataFrame</code> is faster:</p> <pre class="prettyprint"><code>%timeit pandas.DataFrame(df.values[np.random.randint(n, size=n)], columns=columns) # Out: 1000 loops, best of 3: 302 µs per loop </code></pre> <p>@JohnE suggested <code>sample</code> which is unfortunately even slower:</p> <pre class="prettyprint"><code>%timeit df.sample(n, replace=True) # Out: 100 loops, best of 3: 5.14 ms per loop </code></pre> <p>@firelynx suggested <code>merge</code>:</p> <pre class="prettyprint"><code>%timeit df.merge(pandas.DataFrame(index=np.random.randint(n, size=n)), left_index=True, right_index=True, how='right') # Out: 1000 loops, best of 3: 1.23 ms per loop </code></pre> <p>Does anyone have an idea why <code>iloc</code> is so slow and/or whether there are better alternatives than extracting the values, sampling and then constructing a new <code>pandas.DataFrame</code>?</p>

<h3>Indexing Speeds</h3> <p>Boolean Indexing tested to be slightly faster for me:</p> <h3>Boolean Indexing</h3> <pre class="prettyprint"><code>%timeit -n10000 df[np.random.randint(2, size=n).astype(bool)] # 10000 loops, best of 3: 307 µs per loop </code></pre> <h3> <code>numpy</code> sampling & re<code>DataFrame</code>ing</h3> <pre class="prettyprint"><code>%timeit -n10000 pandas.DataFrame(df.values[np.random.randint(n, size=n)], columns=columns) # 10000 loops, best of 3: 380 µs per loop </code></pre>

Drawing a bootstrap sample from a pandas.DataFrame

Tags:

pandas

numpy

I would like to draw a bootstrap sample of a pandas.DataFrame as efficiently as possible. Using the builtin iloc together with a list of integers seems to be slow:

import pandas
import numpy as np
# Generate some data
n = 5000
values = np.random.uniform(size=(n, 5))
# Construct a pandas.DataFrame
columns = ['a', 'b', 'c', 'd', 'e']
df = pandas.DataFrame(values, columns=columns)
# Bootstrap
%timeit df.iloc[np.random.randint(n, size=n)]
# Out: 1000 loops, best of 3: 1.46 ms per loop

Indexing the numpy array is of course much faster:

%timeit values[np.random.randint(n, size=n)]
# Out: 10000 loops, best of 3: 159 µs per loop

But even extracting the values, sampling the numpy array, and constructing a new pandas.DataFrame is faster:

%timeit pandas.DataFrame(df.values[np.random.randint(n, size=n)], columns=columns)
# Out: 1000 loops, best of 3: 302 µs per loop

@JohnE suggested sample which is unfortunately even slower:

%timeit df.sample(n, replace=True)
# Out: 100 loops, best of 3: 5.14 ms per loop

@firelynx suggested merge:

%timeit df.merge(pandas.DataFrame(index=np.random.randint(n, size=n)), left_index=True, right_index=True, how='right')
# Out: 1000 loops, best of 3: 1.23 ms per loop

Does anyone have an idea why iloc is so slow and/or whether there are better alternatives than extracting the values, sampling and then constructing a new pandas.DataFrame?

851

asked Jul 19 '15 15:07

Till Hoffmann

2 Answers

The merge method in pandas is fairly optimized, so I tried my luck with it and it gave me a significant speed increase. Given my machine is a bit slower than yours, I'm also using pandas 0.15.2 Things may be a bit different.

%timeit df.iloc[np.random.randint(n, size=n)]
# 100 loops, best of 3: 2.41 ms per loop

randlist = pandas.DataFrame(index=np.random.randint(n, size=n))
%timeit df.merge(randlist, left_index=True, right_index=True, how='right')
# 1000 loops, best of 3: 1.87 ms per loop

%timeit df.merge(pandas.DataFrame(index=np.random.randint(n, size=n)), left_index=True, right_index=True, how='right')
# 100 loops, best of 3: 2.29 ms per loop

answered Oct 02 '22 05:10

firelynx

Indexing Speeds

Boolean Indexing tested to be slightly faster for me:

Boolean Indexing

%timeit -n10000 df[np.random.randint(2, size=n).astype(bool)]
# 10000 loops, best of 3: 307 µs per loop

`numpy` sampling & re`DataFrame`ing

%timeit -n10000 pandas.DataFrame(df.values[np.random.randint(n, size=n)], columns=columns)
# 10000 loops, best of 3: 380 µs per loop

answered Oct 02 '22 03:10

tmthydvnprt

Related questions
                            
                                Pandas: add new column with count how often the highest score of a day was reached by this person
                            
                                How to compare an array against a list of arrays?
                            
                                How to index an array with its indices in numpy?
                            
                                Find indices for elements in array B best matching those in array A
                            
                                Write a data string to a NumPy character array?
                            
                                Fourier space filtering
                            
                                Construct Numpy index given list of starting and ending positions
                            
                                Floating point precision in Python array
                            
                                Numpy append: Automatically cast an array of the wrong dimension
                            
                                Recursive definitions in Pandas
                            
                                Reading data into numpy array from text file
                            
                                Python: Add a column to numpy 2d array
                            
                                Cannot convert list to array: ValueError: only one element tensors can be converted to Python scalars
                            
                                Remove background of the image using opencv Python
                            
                                How to determine which points are inside of a polygon and which are not (large number of points)?
                            
                                Enum vs String as a parameter in a function
                            
                                Seeding random number generators in parallel programs
                            
                                Start, End and Duration of Maximum Drawdown in Python
                            
                                numpy.savetxt resulting a formatting mismatch error in python 3.5
                            
                                Why is sin(180) not zero when using python and numpy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Drawing a bootstrap sample from a pandas.DataFrame

Tags:

pandas

numpy

Till Hoffmann

People also ask

2 Answers

firelynx

Indexing Speeds

Boolean Indexing

`numpy` sampling & re`DataFrame`ing

tmthydvnprt

Recent Activity

Donate For Us

Drawing a bootstrap sample from a pandas.DataFrame

Tags:

pandas

numpy

Till Hoffmann

People also ask

2 Answers

firelynx

Indexing Speeds

Boolean Indexing

numpy sampling & reDataFrameing

tmthydvnprt

Related questions

Recent Activity

Donate For Us

`numpy` sampling & re`DataFrame`ing