Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas : balancing data

Tags:

python

pandas

Note: This question is not the same as an answer here: "Pandas: sample each group after groupby"

Trying to figure out how to use pandas.DataFrame.sample or any other function to balance this data:

df[class].value_counts()  c1    9170 c2    5266 c3    4523 c4    2193 c5    1956 c6    1896 c7    1580 c8    1407 c9    1324 

I need to get a random sample of each class (c1, c2, .. c9) where sample size is equal to the size of a class with min number of instances. In this example sample size should be the size of class c9 = 1324.

Any simple way to do this with Pandas?

Update

To clarify my question, in the table above :

c1    9170 c2    5266 c3    4523 ... 

Numbers are counts of instances of c1,c2,c3,... classes, so actual data looks like this:

c1 'foo' c2 'bar' c1 'foo-2' c1 'foo-145' c1 'xxx-07' c2 'zzz' ... 

etc.

Update 2

To clarify more:

d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],      'val': [1,2,1,1,2,1,1,2,3,3]     }  df = pd.DataFrame(d)      class   val 0   c1  1 1   c2  2 2   c1  1 3   c1  1 4   c2  2 5   c1  1 6   c1  1 7   c2  2 8   c3  3 9   c3  3  df['class'].value_counts()  c1    5 c2    3 c3    2 Name: class, dtype: int64  g = df.groupby('class') g.apply(lambda x: x.sample(g.size().min()))          class   val class            c1  6   c1  1     5   c1  1 c2  4   c2  2       1   c2  2 c3  9   c3  3     8   c3  3 

Looks like this works. Main questions:

How g.apply(lambda x: x.sample(g.size().min())) works? I know what 'lambda` is, but:

  • What is passed to lambda in x in this case?
  • What is g in g.size()?
  • Why output contains 6,5,4, 1,8,9 numbers? What do they mean?
like image 509
dokondr Avatar asked Aug 23 '17 12:08

dokondr


People also ask

How do you check if a dataset is balanced in Python?

In simple words, you need to check if there is an imbalance in the classes present in your target variable. If you check the ratio between DEATH_EVENT=1 and DEATH_EVENT=0, it is 2:1 which means our dataset is imbalanced. To balance, we can either oversample or undersample the data.

Is pandas good for data analysis?

Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with Matplotlib and Seaborn , Pandas provides a wide range of opportunities for visual analysis of tabular data. The main data structures in Pandas are implemented with Series and DataFrame classes.


1 Answers

g = df.groupby('class') g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))    class  val 0    c1    1 1    c1    1 2    c2    2 3    c2    2 4    c3    3 5    c3    3 

Answers to your follow-up questions

  1. The x in the lambda ends up being a dataframe that is the subset of df represented by the group. Each of these dataframes, one for each group, gets passed through this lambda.
  2. g is the groupby object. I placed it in a named variable because I planned on using it twice. df.groupby('class').size() is an alternative way to do df['class'].value_counts() but since I was going to groupby anyway, I might as well reuse the same groupby, use a size to get the value counts... saves time.
  3. Those numbers are the the index values from df that go with the sampling. I added reset_index(drop=True) to get rid of it.
like image 138
piRSquared Avatar answered Sep 28 '22 13:09

piRSquared