How can I sample a pandas dataframe or graphlab sframe based on a given class\label distribution values eg: I want to sample an data frame having a label\class column to select rows such that each class label is equally fetched thereby having a similar frequency for each class label corresponding to a uniform distribution of class labels . Or best would be to get samples according to the class distribution we want. <pre class="prettyprint"> +------+-------+-------+ | col1 | clol2 | class | +------+-------+-------+ | 4 | 45 | A | +------+-------+-------+ | 5 | 66 | B | +------+-------+-------+ | 5 | 6 | C | +------+-------+-------+ | 4 | 6 | C | +------+-------+-------+ | 321 | 1 | A | +------+-------+-------+ | 32 | 432 | B | +------+-------+-------+ | 5 | 3 | B | +------+-------+-------+ given a huge dataframe like above and the required frequency distribution like below: +-------+--------------+ | class | nostoextract | +-------+--------------+ | A | 2 | +-------+--------------+ | B | 2 | +-------+--------------+ | C | 2 | +-------+--------------+ </pre> The above should extract rows from the first dataframe based on the given frequency distribution in the second frame where the frequency count values are given in nostoextract column to give a sampled frame where each class appears at max 2 times. should ignore and continue if cant find sufficient classes to meet the required count. The resulting dataframe is to be used for a decision tree based classifier. As a commentator puts it the sampled dataframe has to contain nostoextract different instances of the corresponding class? Unless there are not enough examples for a given class in which case you just take all the available ones.

Can you split your first dataframe into class-specific sub-dataframes, and then sample at will from those? i.e. <pre class="prettyprint"><code>dfa = df[df['class']=='A'] dfb = df[df['class']=='B'] dfc = df[df['class']=='C'] .... </code></pre> Then once you've split/created/filtered on dfa, dfb, dfc, pick a number from the top as desired (if dataframes don't have any particular sort-pattern) <pre class="prettyprint"><code> dfasamplefive = dfa[:5] </code></pre> Or use the sample method as described by a previous commenter to directly take a random sample: <pre class="prettyprint"><code>dfasamplefive = dfa.sample(n=5) </code></pre> If that suits your needs, all that's left to do is automate the process, feeding in the number to be sampled from the control dataframe you have as your second dataframe containing the desired number of samples.

Sampling a dataframe based on a given distribution

Tags:

python

pandas

graphlab

sframe

How can I sample a pandas dataframe or graphlab sframe based on a given class\label distribution values eg: I want to sample an data frame having a label\class column to select rows such that each class label is equally fetched thereby having a similar frequency for each class label corresponding to a uniform distribution of class labels . Or best would be to get samples according to the class distribution we want.

+------+-------+-------+
| col1 | clol2 | class |
+------+-------+-------+
| 4    | 45    | A     |
+------+-------+-------+
| 5    | 66    | B     |
+------+-------+-------+
| 5    | 6     | C     |
+------+-------+-------+
| 4    | 6     | C     |
+------+-------+-------+
| 321  | 1     | A     |
+------+-------+-------+
| 32   | 432   | B     |
+------+-------+-------+
| 5    | 3     | B     |
+------+-------+-------+

given a huge dataframe like above and the required frequency distribution like below:
+-------+--------------+
| class | nostoextract |
+-------+--------------+
| A     | 2            |
+-------+--------------+
| B     | 2            |
+-------+--------------+
| C     | 2            |
+-------+--------------+

The above should extract rows from the first dataframe based on the given frequency distribution in the second frame where the frequency count values are given in nostoextract column to give a sampled frame where each class appears at max 2 times. should ignore and continue if cant find sufficient classes to meet the required count. The resulting dataframe is to be used for a decision tree based classifier.

As a commentator puts it the sampled dataframe has to contain nostoextract different instances of the corresponding class? Unless there are not enough examples for a given class in which case you just take all the available ones.

885

asked Oct 13 '15 07:10

stackit

2 Answers

I think this will solve your problem:

import pandas as pd

data = pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
                     'clol2':[45, 66, 6, 6, 1, 432, 3],
                     'class':['A', 'B', 'C', 'C', 'A', 'B', 'B']})

freq = pd.DataFrame({'class':['A', 'B', 'C'],
                     'nostoextract':[2, 2, 2], })

def bootstrap(data, freq):
    freq = freq.set_index('class')

    # This function will be applied on each group of instances of the same
    # class in `data`.
    def sampleClass(classgroup):
        cls = classgroup['class'].iloc[0]
        nDesired = freq.nostoextract[cls]
        nRows = len(classgroup)

        nSamples = min(nRows, nDesired)
        return classgroup.sample(nSamples)

    samples = data.groupby('class').apply(sampleClass)

    # If you want a new index with ascending values
    # samples.index = range(len(samples))

    # If you want an index which is equal to the row in `data` where the sample
    # came from
    samples.index = samples.index.get_level_values(1)

    # If you don't change it then you'll have a multiindex with level 0
    # being the class and level 1 being the row in `data` where
    # the sample came from.

    return samples

print(bootstrap(data,freq))

Prints:

  class  clol2  cols1
0     A     45      4
4     A      1    321
1     B     66      5
5     B    432     32
3     C      6      4
2     C      6      5

If you don't want the result to be ordered by classes, you can permute it in the end.

127

answered Oct 02 '22 18:10

swenzel

Can you split your first dataframe into class-specific sub-dataframes, and then sample at will from those?

i.e.

dfa = df[df['class']=='A']
dfb = df[df['class']=='B']
dfc = df[df['class']=='C']
....

Then once you've split/created/filtered on dfa, dfb, dfc, pick a number from the top as desired (if dataframes don't have any particular sort-pattern)

 dfasamplefive = dfa[:5]

Or use the sample method as described by a previous commenter to directly take a random sample:

dfasamplefive = dfa.sample(n=5)

If that suits your needs, all that's left to do is automate the process, feeding in the number to be sampled from the control dataframe you have as your second dataframe containing the desired number of samples.

answered Oct 02 '22 18:10

Thomas Kimber

Related questions
                            
                                Generator expression never raises StopIteration
                            
                                How to monitor queue health in celery
                            
                                set and frozenset difference in implementation
                            
                                Reindexing a level of a MultiIndex to arbitrary order in Pandas
                            
                                How to checkout a tag with GitPython
                            
                                How can I call an OpenModelica model in Python with OMPython?
                            
                                Filter values from a scipy sparse matrix
                            
                                PyQt: Adding rows to QTableView using QAbstractTableModel
                            
                                Deploying django by python manage.py runserver to production on VPS
                            
                                Pandas convert Dataframe to Nested Json
                            
                                Query HDF5 in Pandas
                            
                                Python cross platform hidden file
                            
                                Anaconda and VirtualEnv
                            
                                Pandas aligning multiple dataframes with TimeStamp index
                            
                                How should I document class and object attributes using Numpy's style? [closed]
                            
                                Override serializer.data in Django REST Framework
                            
                                Merge multiple declarative bases in SQLAlchemy
                            
                                "Firefox quit unexpectedly." when running basic Selenium script in Python
                            
                                non Invertible of a ARIMA model
                            
                                Python Gaussian Kernel density calculate score for new values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With