Say I want to do a stratified sample from a dataframe in Pandas so that I get 5%
of rows for every value of a given column. How can I do that?
For example, in the dataframe below, I would like to sample 5%
of the rows associated with each value of the column Z
. Is there any way to sample groups from a dataframe loaded in memory?
> df
X Y Z
1 123 a
2 89 b
1 234 a
4 893 a
6 234 b
2 893 b
3 200 c
5 583 c
2 583 c
6 100 c
More generally, what if I this dataframe in disk in a huge file (e.g. 8 GB of a csv file). Is there any way to do this sampling without having to load the entire dataframe in memory?
How about loading only the 'Z' column into memory using the 'usecols' option. Say the file is sample.csv. That should use much less memory if you have a bunch of columns. Then assuming that fits into memory, I think this will work for you.
stratfraction = 0.05
#Load only the Z column
df = pd.read_csv('sample.csv', usecols = ['Z'])
#Generate the counts per value of Z
df['Obs'] = 1
gp = df.groupby('Z')
#Get number of samples per group
df2 = np.ceil(gp.count()*stratfraction)
#Generate the indices of the request sample (first entrie)
stratsample = []
for i, key in enumerate(gp.groups):
FirstFracEntries = gp.groups[key][0:int(df2['Obs'][i])]
stratsample.extend(FirstFracEntries)
#Generate a list of rows to skip since read_csv doesn't have a rows to keep option
stratsample.sort
RowsToSkip = set(df.index.values).difference(stratsample)
#Load only the requested rows (no idea how well this works for a really giant list though)
df3 = df = pd.read_csv('sample.csv', skiprows = RowsToSkip)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With