Note: This question is not the same as an answer here: "Pandas: sample each group after groupby"
Trying to figure out how to use pandas.DataFrame.sample
or any other function to balance this data:
df[class].value_counts() c1 9170 c2 5266 c3 4523 c4 2193 c5 1956 c6 1896 c7 1580 c8 1407 c9 1324
I need to get a random sample of each class (c1, c2, .. c9) where sample size is equal to the size of a class with min number of instances. In this example sample size should be the size of class c9 = 1324.
Any simple way to do this with Pandas?
Update
To clarify my question, in the table above :
c1 9170 c2 5266 c3 4523 ...
Numbers are counts of instances of c1,c2,c3,... classes, so actual data looks like this:
c1 'foo' c2 'bar' c1 'foo-2' c1 'foo-145' c1 'xxx-07' c2 'zzz' ...
etc.
Update 2
To clarify more:
d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'], 'val': [1,2,1,1,2,1,1,2,3,3] } df = pd.DataFrame(d) class val 0 c1 1 1 c2 2 2 c1 1 3 c1 1 4 c2 2 5 c1 1 6 c1 1 7 c2 2 8 c3 3 9 c3 3 df['class'].value_counts() c1 5 c2 3 c3 2 Name: class, dtype: int64 g = df.groupby('class') g.apply(lambda x: x.sample(g.size().min())) class val class c1 6 c1 1 5 c1 1 c2 4 c2 2 1 c2 2 c3 9 c3 3 8 c3 3
Looks like this works. Main questions:
How g.apply(lambda x: x.sample(g.size().min()))
works? I know what 'lambda` is, but:
lambda
in x
in this case? g
in g.size()
? In simple words, you need to check if there is an imbalance in the classes present in your target variable. If you check the ratio between DEATH_EVENT=1 and DEATH_EVENT=0, it is 2:1 which means our dataset is imbalanced. To balance, we can either oversample or undersample the data.
Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with Matplotlib and Seaborn , Pandas provides a wide range of opportunities for visual analysis of tabular data. The main data structures in Pandas are implemented with Series and DataFrame classes.
g = df.groupby('class') g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)) class val 0 c1 1 1 c1 1 2 c2 2 3 c2 2 4 c3 3 5 c3 3
Answers to your follow-up questions
x
in the lambda
ends up being a dataframe that is the subset of df
represented by the group. Each of these dataframes, one for each group, gets passed through this lambda
.g
is the groupby
object. I placed it in a named variable because I planned on using it twice. df.groupby('class').size()
is an alternative way to do df['class'].value_counts()
but since I was going to groupby
anyway, I might as well reuse the same groupby
, use a size
to get the value counts... saves time.df
that go with the sampling. I added reset_index(drop=True)
to get rid of it.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With