Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drawing a bootstrap sample from a pandas.DataFrame

Tags:

pandas

numpy

I would like to draw a bootstrap sample of a pandas.DataFrame as efficiently as possible. Using the builtin iloc together with a list of integers seems to be slow:

import pandas
import numpy as np
# Generate some data
n = 5000
values = np.random.uniform(size=(n, 5))
# Construct a pandas.DataFrame
columns = ['a', 'b', 'c', 'd', 'e']
df = pandas.DataFrame(values, columns=columns)
# Bootstrap
%timeit df.iloc[np.random.randint(n, size=n)]
# Out: 1000 loops, best of 3: 1.46 ms per loop

Indexing the numpy array is of course much faster:

%timeit values[np.random.randint(n, size=n)]
# Out: 10000 loops, best of 3: 159 µs per loop

But even extracting the values, sampling the numpy array, and constructing a new pandas.DataFrame is faster:

%timeit pandas.DataFrame(df.values[np.random.randint(n, size=n)], columns=columns)
# Out: 1000 loops, best of 3: 302 µs per loop

@JohnE suggested sample which is unfortunately even slower:

%timeit df.sample(n, replace=True)
# Out: 100 loops, best of 3: 5.14 ms per loop

@firelynx suggested merge:

%timeit df.merge(pandas.DataFrame(index=np.random.randint(n, size=n)), left_index=True, right_index=True, how='right')
# Out: 1000 loops, best of 3: 1.23 ms per loop

Does anyone have an idea why iloc is so slow and/or whether there are better alternatives than extracting the values, sampling and then constructing a new pandas.DataFrame?

like image 851
Till Hoffmann Avatar asked Jul 19 '15 15:07

Till Hoffmann


People also ask

How do you get samples in pandas?

Python pandas provides a function, named sample() to perform random sampling. The number of samples to be extracted can be expressed in two alternative ways: specify the exact number of random rows to extract. specify the percentage of random rows to extract.

How do you import a sample data set using pandas?

There are a total of three keys: namely integer, datetime, and category. First, you will import the pandas library and then pass the URL to the pd. read_json() which will return a dataframe. The columns of the dataframes represent the keys, and the rows are the values of the JSON.


2 Answers

The merge method in pandas is fairly optimized, so I tried my luck with it and it gave me a significant speed increase. Given my machine is a bit slower than yours, I'm also using pandas 0.15.2 Things may be a bit different.

%timeit df.iloc[np.random.randint(n, size=n)]
# 100 loops, best of 3: 2.41 ms per loop

randlist = pandas.DataFrame(index=np.random.randint(n, size=n))
%timeit df.merge(randlist, left_index=True, right_index=True, how='right')
# 1000 loops, best of 3: 1.87 ms per loop

%timeit df.merge(pandas.DataFrame(index=np.random.randint(n, size=n)), left_index=True, right_index=True, how='right')
# 100 loops, best of 3: 2.29 ms per loop
like image 90
firelynx Avatar answered Oct 02 '22 05:10

firelynx


Indexing Speeds

Boolean Indexing tested to be slightly faster for me:

Boolean Indexing

%timeit -n10000 df[np.random.randint(2, size=n).astype(bool)]
# 10000 loops, best of 3: 307 µs per loop

numpy sampling & reDataFrameing

%timeit -n10000 pandas.DataFrame(df.values[np.random.randint(n, size=n)], columns=columns)
# 10000 loops, best of 3: 380 µs per loop
like image 32
tmthydvnprt Avatar answered Oct 02 '22 03:10

tmthydvnprt