I'm trying to sample a pandas DataFrame based on a dictionary and a specific column. So for each value of y column, I know exactly how many observations I would like to pick.
I can do this via a groupby apply combo as such:
import pandas as pd
df = pd.DataFrame({'y': [2,2,0,0,0,1,1,1,1,1], 'x': 1, 'z': 2})
y x z
0 2 1 2
1 2 1 2
2 0 1 2
3 0 1 2
4 0 1 2
5 1 1 2
sizes = {0: 2, 1: 1, 2:1}
df.groupby('y').apply(lambda x: x.sample(sizes[x['y'].values[0]]))
y y x z
0 2 0 1 2 4 0 1 2 1 5 1 1 2 2 0 2 1 2
However, if I do unique instead of values (which should be equivavelent, I get a weird KeyError: 'y' error on the dataframe:
df.groupby('y').apply(lambda x: x.sample(sizes[x.y.unique()[0]]))
Can someone explain why this is happening?
EDIT:
This happened on 0.23.1 but not on 0.23.1 so this was probably a bug.
I think you need .name attribute:
df1 = df.groupby('y').apply(lambda x: x.sample(sizes[x.name]))
print (df1)
y x z
y
0 4 0 1 2
2 0 1 2
1 6 1 1 2
2 0 2 1 2
If possible some value not match in dictionary use get with 0 for not matched values:
df1 = df.groupby('y').apply(lambda x: x.sample(sizes.get(x.name, 0)))
EDIT:
Problem is unique return one element numpy array:
def f(x):
print (x['y'].unique())
print (x['y'].unique()[0])
print (sizes[x['y'].unique()[0]])
print (x.sample(sizes[x['y'].unique()[0]]))
df1 = df.groupby('y').apply(f)
[0]
0
2
y x z
2 0 1 2
4 0 1 2
[0]
0
2
y x z
4 0 1 2
2 0 1 2
[1]
1
1
y x z
6 1 1 2
[2]
2
1
y x z
0 2 1 2
df1 = df.groupby('y').apply(lambda x: x.sample(sizes[x.y.unique()[0]]))
print (df1)
y x z
y
0 4 0 1 2
2 0 1 2
1 6 1 1 2
2 0 2 1 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With