Pandas create random samples without duplicates

Question

I have a pandas dataframe containing ~200,000 rows and I would like to create 5 random samples of 1000 rows each however I do not want any of these samples to contain the same row twice.

To create a random sample I have been using:

import numpy as np
rows = np.random.choice(df.index.values, 1000)
sampled_df = df.ix[rows]

However just doing this several times would run the risk of having duplicates. Would the best way to handle this be keeping track of which rows are sampled each time?

Admin · Accepted Answer

You can use df.sample.

A dataframe with 100 rows and 5 columns:

df = pd.DataFrame(np.random.randn(100, 5), columns = list("abcde"))

Sample 5 rows:

df.sample(5)
Out[8]: 
           a         b         c         d         e
84  0.012201 -0.053014 -0.952495  0.680935  0.006724
45 -1.347292  1.358781 -0.838931 -0.280550 -0.037584
10 -0.487169  0.999899  0.524546 -1.289632 -0.370625
64  1.542704 -0.971672 -1.150900  0.554445 -1.328722
99  0.012143 -2.450915 -0.718519 -1.192069 -1.268863

This ensures those 5 rows are different. If you want to repeat this process, I'd suggest sampling number_of_rows * number_of_samples rows. For example if each sample is going to contain 5 rows and you need 10 samples, sample 50 rows. The first 5 will be the first sample, the second five will be the second...

all_samples = df.sample(50)
samples = [all_samples.iloc[5*i:5*i+5] for i in range(10)]

Pandas create random samples without duplicates

Tags:

python

pandas

GNMO11

1 Answers

Recent Activity

Donate For Us

Pandas create random samples without duplicates

Tags:

python

pandas

GNMO11

1 Answers

Related questions

Recent Activity

Donate For Us