Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sampling groups in Pandas

Tags:

python

pandas

Say I want to do a stratified sample from a dataframe in Pandas so that I get 5% of rows for every value of a given column. How can I do that?

For example, in the dataframe below, I would like to sample 5% of the rows associated with each value of the column Z. Is there any way to sample groups from a dataframe loaded in memory?

> df 

   X   Y  Z
   1 123  a
   2  89  b
   1 234  a
   4 893  a
   6 234  b
   2 893  b
   3 200  c
   5 583  c
   2 583  c
   6 100  c

More generally, what if I this dataframe in disk in a huge file (e.g. 8 GB of a csv file). Is there any way to do this sampling without having to load the entire dataframe in memory?

like image 820
Amelio Vazquez-Reina Avatar asked Aug 08 '14 12:08

Amelio Vazquez-Reina


1 Answers

How about loading only the 'Z' column into memory using the 'usecols' option. Say the file is sample.csv. That should use much less memory if you have a bunch of columns. Then assuming that fits into memory, I think this will work for you.

stratfraction = 0.05
#Load only the Z column
df = pd.read_csv('sample.csv', usecols = ['Z'])
#Generate the counts per value of Z
df['Obs']  = 1
gp = df.groupby('Z')
#Get number of samples per group 
df2 = np.ceil(gp.count()*stratfraction)
#Generate the indices of the request sample (first entrie)
stratsample = []
for i, key in enumerate(gp.groups):
    FirstFracEntries = gp.groups[key][0:int(df2['Obs'][i])]
    stratsample.extend(FirstFracEntries) 
#Generate a list of rows to skip since read_csv doesn't have a rows to keep option
stratsample.sort
RowsToSkip = set(df.index.values).difference(stratsample)
#Load only the requested rows (no idea how well this works for a really giant list though)         
df3 = df = pd.read_csv('sample.csv', skiprows  = RowsToSkip)
like image 196
BKay Avatar answered Sep 25 '22 22:09

BKay