Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does random sampling scale with the dataset not the sample size? (pandas .sample() example)

When sampling randomly from distributions of varying sizes I was surprised to observe that execution time seems to scale mostly with the size of the dataset being sampled from, not the number of values being sampled. Example:

import pandas as pd
import numpy as np
import time as tm

#generate a small and a large dataset
testSeriesSmall = pd.Series(np.random.randn(10000))
testSeriesLarge = pd.Series(np.random.randn(10000000))

sampleSize = 10
tStart = tm.time()
currSample = testSeriesLarge.sample(n=sampleSize).values
print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesLarge), (tm.time() - tStart)))

tStart = tm.time()
currSample = testSeriesSmall.sample(n=sampleSize).values
print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesSmall), (tm.time() - tStart)))

sampleSize = 1000
tStart = tm.time()
currSample = testSeriesLarge.sample(n=sampleSize).values
print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesLarge), (tm.time() - tStart)))

tStart = tm.time()
currSample = testSeriesSmall.sample(n=sampleSize).values
print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesSmall), (tm.time() - tStart)))

The output is:

sample 10 from 10000 values: 0.00126 s
sample 10 from 10000000 values: 1.10504 s
sample 1000 from 10000 values: 0.00122 s
sample 1000 from 10000000 values: 1.15000 s

This seems counterintuitive. Maybe I'm dense, but the problem seems similar to generating a list of random indices, and I would have expected the number of values sampled to matter and the size of the dataset not to matter much. I've tried another implementation or two with similar results and it's beginning to feel like I'm just missing a fundamental issue, though.

My questions are twofold: (1) Is this a fundamental issue or a quirk of implementation in pandas? (2) Is there a significantly faster approach that can be taken to randomly sample from large datasets in this way?

like image 638
c_layton Avatar asked Mar 25 '17 01:03

c_layton


People also ask

What does sample () do in Python?

The sample() method returns a list with a randomly selection of a specified number of items from a sequnce.

How do you generate a random sampling from a dataset in Python?

Given a dataframe with N rows, random Sampling extract X random rows from the dataframe, with X ≤ N. Python pandas provides a function, named sample() to perform random sampling. The number of samples to be extracted can be expressed in two alternative ways: specify the exact number of random rows to extract.

How do you correctly select a sample from a huge dataset?

The simplest thing to do is taking a random sub-sample with uniform distribution and check if it's significant or not. If it's reasonably significant, we'll keep it. If it's not, we'll take another sample and repeat the procedure until we get a good significance level.


2 Answers

pandas.Series.sample() in your case boils down to this:

rs = np.random.RandomState()
locs = rs.choice(axis_length, size=n, replace=False)
return self.take(locs)

The slow part is rs.choice():

%timeit rs.choice(100000000, size=1, replace=False)
1 loop, best of 3: 9.43 s per loop

It takes about 10 seconds to generate a single random number! If you divide the first argument by 10, it takes about 1 second. That's slow!

If you use replace=True it's super fast. So that's one workaround for you, if you don't mind having duplicate entries in your results.

The NumPy documentation for choice(replace=False) says:

This is equivalent to np.random.permutation(np.arange(5))[:3]

Which pretty much explains the problem--it generates a huge array of possible values, shuffles them, then takes the first N. This is the root cause of your performance problem, and has already been reported as an issue in NumPy here: https://github.com/numpy/numpy/pull/5158

It is apparently difficult to fix in NumPy, because people rely on the result of choice() not changing (between versions of NumPy) when using the same random seed value.

Since your use case is quite narrow, you can do something like this:

def sample(series, n):
    locs = np.random.randint(0, len(series), n*2)
    locs = np.unique(locs)[:n]
    assert len(locs) == n, "sample() assumes n << len(series)"
    return series.take(locs)

That gives much faster times:

sample 10 from 10000 values: 0.00735 s
sample 10 from 1000000 values: 0.00944 s
sample 10 from 100000000 values: 1.44148 s
sample 1000 from 10000 values: 0.00319 s
sample 1000 from 1000000 values: 0.00802 s
sample 1000 from 100000000 values: 0.01989 s
sample 100000 from 1000000 values: 0.05178 s
sample 100000 from 100000000 values: 0.93336 s
like image 130
John Zwinck Avatar answered Sep 22 '22 02:09

John Zwinck


This looks to be an internal numpy issue. I believe the pandas sample method calls numpy.random.choice. Let's take a look at how numpy performs with various array sizes and sample sizes.

Create arrays

large = np.arange(1000000)
small = np.arange(1000)

Time the sample without replacement

%timeit np.random.choice(large, 10, replace=False)
10 loops, best of 3: 27.4 ms per loop

%timeit np.random.choice(small, 10, replace=False)
10000 loops, best of 3: 41.4 µs per loop

Time the sample with replacement

%timeit np.random.choice(large, 10, replace=True)
100000 loops, best of 3: 11.7 µs per loop

%timeit np.random.choice(small, 10, replace=True)
100000 loops, best of 3: 12.2 µs per loop

Very interestingly, when doing the sample without replacement, the large array is taking nearly 3 orders of magnitude longer and it is exactly three orders of magnitude as large. This suggests to me that numpy is randomly sorting the array and then taking the first 10 items.

When sampling with replacement, each value is independently chosen so the timings are identical.

like image 29
Ted Petrou Avatar answered Sep 23 '22 02:09

Ted Petrou