Note: This question is not the same as an answer here: "Pandas: sample each group after groupby" Trying to figure out how to use <code>pandas.DataFrame.sample</code> or any other function to balance this data: <pre class="prettyprint"><code>df[class].value_counts() c1 9170 c2 5266 c3 4523 c4 2193 c5 1956 c6 1896 c7 1580 c8 1407 c9 1324 </code></pre> I need to get a random sample of each class (c1, c2, .. c9) where sample size is equal to the size of a class with min number of instances. In this example sample size should be the size of class c9 = 1324. Any simple way to do this with Pandas? Update To clarify my question, in the table above : <pre class="prettyprint"><code>c1 9170 c2 5266 c3 4523 ... </code></pre> Numbers are counts of instances of c1,c2,c3,... classes, so actual data looks like this: <pre class="prettyprint"><code>c1 'foo' c2 'bar' c1 'foo-2' c1 'foo-145' c1 'xxx-07' c2 'zzz' ... </code></pre> etc. Update 2 To clarify more: <pre class="prettyprint"><code>d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'], 'val': [1,2,1,1,2,1,1,2,3,3] } df = pd.DataFrame(d) class val 0 c1 1 1 c2 2 2 c1 1 3 c1 1 4 c2 2 5 c1 1 6 c1 1 7 c2 2 8 c3 3 9 c3 3 df['class'].value_counts() c1 5 c2 3 c3 2 Name: class, dtype: int64 g = df.groupby('class') g.apply(lambda x: x.sample(g.size().min())) class val class c1 6 c1 1 5 c1 1 c2 4 c2 2 1 c2 2 c3 9 c3 3 8 c3 3 </code></pre> Looks like this works. Main questions: How <code>g.apply(lambda x: x.sample(g.size().min()))</code> works? I know what 'lambda` is, but: <ul> <li>What is passed to <code>lambda</code> in <code>x</code> in this case? </li> <li>What is <code>g</code> in <code>g.size()</code>? </li> <li>Why output contains 6,5,4, 1,8,9 numbers? What do they mean?</li> </ul>

<pre class="prettyprint"><code>g = df.groupby('class') g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)) class val 0 c1 1 1 c1 1 2 c2 2 3 c2 2 4 c3 3 5 c3 3 </code></pre> <hr> Answers to your follow-up questions <ol> <li>The <code>x</code> in the <code>lambda</code> ends up being a dataframe that is the subset of <code>df</code> represented by the group. Each of these dataframes, one for each group, gets passed through this <code>lambda</code>.</li> <li> <code>g</code> is the <code>groupby</code> object. I placed it in a named variable because I planned on using it twice. <code>df.groupby('class').size()</code> is an alternative way to do <code>df['class'].value_counts()</code> but since I was going to <code>groupby</code> anyway, I might as well reuse the same <code>groupby</code>, use a <code>size</code> to get the value counts... saves time.</li> <li>Those numbers are the the index values from <code>df</code> that go with the sampling. I added <code>reset_index(drop=True)</code> to get rid of it.</li> </ol>

Pandas : balancing data

Tags:

python

pandas

Note: This question is not the same as an answer here: "Pandas: sample each group after groupby"

Trying to figure out how to use pandas.DataFrame.sample or any other function to balance this data:

df[class].value_counts()  c1    9170 c2    5266 c3    4523 c4    2193 c5    1956 c6    1896 c7    1580 c8    1407 c9    1324

I need to get a random sample of each class (c1, c2, .. c9) where sample size is equal to the size of a class with min number of instances. In this example sample size should be the size of class c9 = 1324.

Any simple way to do this with Pandas?

Update

To clarify my question, in the table above :

c1    9170 c2    5266 c3    4523 ...

Numbers are counts of instances of c1,c2,c3,... classes, so actual data looks like this:

c1 'foo' c2 'bar' c1 'foo-2' c1 'foo-145' c1 'xxx-07' c2 'zzz' ...

etc.

Update 2

To clarify more:

d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],      'val': [1,2,1,1,2,1,1,2,3,3]     }  df = pd.DataFrame(d)      class   val 0   c1  1 1   c2  2 2   c1  1 3   c1  1 4   c2  2 5   c1  1 6   c1  1 7   c2  2 8   c3  3 9   c3  3  df['class'].value_counts()  c1    5 c2    3 c3    2 Name: class, dtype: int64  g = df.groupby('class') g.apply(lambda x: x.sample(g.size().min()))          class   val class            c1  6   c1  1     5   c1  1 c2  4   c2  2       1   c2  2 c3  9   c3  3     8   c3  3

Looks like this works. Main questions:

How g.apply(lambda x: x.sample(g.size().min())) works? I know what 'lambda` is, but:

What is passed to lambda in x in this case?
What is g in g.size()?
Why output contains 6,5,4, 1,8,9 numbers? What do they mean?

509

asked Aug 23 '17 12:08

dokondr

1 Answers

g = df.groupby('class') g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))    class  val 0    c1    1 1    c1    1 2    c2    2 3    c2    2 4    c3    3 5    c3    3

Answers to your follow-up questions

The x in the lambda ends up being a dataframe that is the subset of df represented by the group. Each of these dataframes, one for each group, gets passed through this lambda.
g is the groupby object. I placed it in a named variable because I planned on using it twice. df.groupby('class').size() is an alternative way to do df['class'].value_counts() but since I was going to groupby anyway, I might as well reuse the same groupby, use a size to get the value counts... saves time.
Those numbers are the the index values from df that go with the sampling. I added reset_index(drop=True) to get rid of it.

138

answered Sep 28 '22 13:09

piRSquared

Related questions
                            
                                Interactive pixel information of an image in Python?
                            
                                UDP Client/Server Socket in Python
                            
                                Understanding execute async script in Selenium
                            
                                How to convert string to datetime with nulls - python, pandas?
                            
                                determine OS distribution of a docker image
                            
                                How to add a new entry into a dictionary object while using jinja2?
                            
                                Expected view to be called with a URL keyword argument named "pk"
                            
                                Cannot understand numpy argpartition output
                            
                                Use functools' @lru_cache without specifying maxsize parameter
                            
                                AttributeError: 'str' object has no attribute 'decode' in fitting Logistic Regression Model
                            
                                List comprehension for loops Python
                            
                                Why always add self as first argument to class methods? [duplicate]
                            
                                Create file but if name exists add number
                            
                                Cygwin gcc issue - cannot find Python.h
                            
                                PySpark Drop Rows
                            
                                Write comments in CSV file with pandas
                            
                                How to let MagicMock behave like a dict?
                            
                                How can I log request POST body in Flask?
                            
                                changing the marker size in python seaborn lmplot
                            
                                Disallowed Host at Django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With