Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sample Pandas based on dictionary

I'm trying to sample a pandas DataFrame based on a dictionary and a specific column. So for each value of y column, I know exactly how many observations I would like to pick.

I can do this via a groupby apply combo as such:

import pandas as pd

df = pd.DataFrame({'y': [2,2,0,0,0,1,1,1,1,1], 'x': 1, 'z': 2})

    y   x   z
0   2   1   2
1   2   1   2
2   0   1   2
3   0   1   2
4   0   1   2
5   1   1   2

sizes = {0: 2, 1: 1, 2:1}

df.groupby('y').apply(lambda x: x.sample(sizes[x['y'].values[0]]))

y y x z

0 2 0 1 2 4 0 1 2 1 5 1 1 2 2 0 2 1 2

However, if I do unique instead of values (which should be equivavelent, I get a weird KeyError: 'y' error on the dataframe:

df.groupby('y').apply(lambda x: x.sample(sizes[x.y.unique()[0]]))

Can someone explain why this is happening?

EDIT:

This happened on 0.23.1 but not on 0.23.1 so this was probably a bug.

like image 527
niczky12 Avatar asked Jun 12 '26 12:06

niczky12


1 Answers

I think you need .name attribute:

df1 = df.groupby('y').apply(lambda x: x.sample(sizes[x.name]))
print (df1)

     y  x  z
y           
0 4  0  1  2
  2  0  1  2
1 6  1  1  2
2 0  2  1  2

If possible some value not match in dictionary use get with 0 for not matched values:

df1 = df.groupby('y').apply(lambda x: x.sample(sizes.get(x.name, 0)))

EDIT:

Problem is unique return one element numpy array:

def f(x):
    print (x['y'].unique())
    print (x['y'].unique()[0])
    print (sizes[x['y'].unique()[0]])
    print (x.sample(sizes[x['y'].unique()[0]]))

df1 = df.groupby('y').apply(f)

[0]
0
2
   y  x  z
2  0  1  2
4  0  1  2
[0]
0
2
   y  x  z
4  0  1  2
2  0  1  2
[1]
1
1
   y  x  z
6  1  1  2
[2]
2
1
   y  x  z
0  2  1  2

df1 = df.groupby('y').apply(lambda x: x.sample(sizes[x.y.unique()[0]]))
print (df1)
     y  x  z
y           
0 4  0  1  2
  2  0  1  2
1 6  1  1  2
2 0  2  1  2
like image 88
jezrael Avatar answered Jun 15 '26 02:06

jezrael