Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random Sample of a subset of a dataframe in Pandas

Say i have a dataframe with 100,000 entries and want to split it into 100 sections of 1000 entries.

How do i take a random sample of say size 50 of just one of the 100 sections. the data set is already ordered such that the first 1000 results are the first section the next section the next and so on.

many thanks

like image 647
WGP Avatar asked Jun 28 '16 20:06

WGP


People also ask

How do you use Pandas to generate a random subset of rows of your dataset?

The easiest way to randomly select rows from a Pandas dataframe is to use the sample() method. For example, if your dataframe is called “df”, df. sample(n=250) will result in that 200 rows were selected randomly. Note, removing the n parameter will result in one random row instead of multiple rows.


2 Answers

You can use the sample method*:

In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])  In [12]: df.sample(2) Out[12]:    A  B 0  1  2 2  5  6  In [13]: df.sample(2) Out[13]:    A  B 3  7  8 0  1  2 

*On one of the section DataFrames.

Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.

In [14]: df.sample(5) ValueError: Cannot take a larger sample than population when 'replace=False'  In [15]: df.sample(5, replace=True) Out[15]:    A  B 0  1  2 1  3  4 2  5  6 3  7  8 1  3  4 
like image 140
Andy Hayden Avatar answered Sep 18 '22 20:09

Andy Hayden


One solution is to use the choice function from numpy.

Say you want 50 entries out of 100, you can use:

import numpy as np chosen_idx = np.random.choice(1000, replace=False, size=50) df_trimmed = df.iloc[chosen_idx] 

This is of course not considering your block structure. If you want a 50 item sample from block i for example, you can do:

import numpy as np block_start_idx = 1000 * i chosen_idx = np.random.choice(1000, replace=False, size=50) df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx] 
like image 41
jpjandrade Avatar answered Sep 18 '22 20:09

jpjandrade