Suppose I have some observations, each with an indicated class from <code>1</code> to <code>n</code>. Each of these classes may not necessarily occur equally in the data set. How can I equally sample from the dataframe? Right now I do something like... <pre class="prettyprint"><code>frames = [] classes = df.classes.unique() for i in classes: g = df[df.classes = i].sample(sample_size) frames.append(g) equally_sampled = pd.concat(frames) </code></pre> Is there a pandas function to equally sample?

For more elegance you can do this: <pre class="prettyprint"><code>df.groupby('classes').apply(lambda x: x.sample(sample_size)) </code></pre> <hr> <h3>Extension:</h3> You can make the <code>sample_size</code> a function of group size to sample with equal probabilities (or proportionately): <pre class="prettyprint"><code>nrows = len(df) total_sample_size = 1e4 df.groupby('classes').\ apply(lambda x: x.sample(int((x.count()/nrows)*total_sample_size))) </code></pre> It won't result in the exact number of rows as <code>total_sample_size</code> but sampling will be more proportional than the naive method.

How can I sample equally from a dataframe?

Tags:

python

pandas

Suppose I have some observations, each with an indicated class from 1 to n. Each of these classes may not necessarily occur equally in the data set.

How can I equally sample from the dataframe? Right now I do something like...

frames = []
classes = df.classes.unique()

for i in classes:
    g = df[df.classes = i].sample(sample_size)
    frames.append(g)

equally_sampled = pd.concat(frames)

Is there a pandas function to equally sample?

585

asked Nov 17 '16 02:11

Demetri Pananos

1 Answers

For more elegance you can do this:

df.groupby('classes').apply(lambda x: x.sample(sample_size))

Extension:

You can make the sample_size a function of group size to sample with equal probabilities (or proportionately):

nrows = len(df)
total_sample_size = 1e4
df.groupby('classes').\
    apply(lambda x: x.sample(int((x.count()/nrows)*total_sample_size)))

It won't result in the exact number of rows as total_sample_size but sampling will be more proportional than the naive method.

135

answered Oct 11 '22 15:10

Kartik

Related questions
                            
                                Javascript is giving a different answer to same algorithm in Python
                            
                                Can't install zbar
                            
                                Set openpyxl cell format to currency
                            
                                Printing string with two columns
                            
                                JavaScript raises SyntaxError with data rendered in Jinja template
                            
                                Writing multiple pandas dataframes to multiple excel worksheets
                            
                                Is it possible to split a network across multiple GPUs in tensorflow?
                            
                                Python Inheritance: Is it necessary to explicitly call the parents constructor and destructor?
                            
                                Can't install python Polyglot package on Windows
                            
                                How to print progress when training a DNNClassifier in tensorflow r0.9 (skflow)?
                            
                                Aggregate query in mongo works, does not in Pymongo
                            
                                DataFrame: add column whose values are the quantile number/rank of an existing column?
                            
                                TypeError: list indices must be integers, not str (boolean convertion actually)
                            
                                How to combine n-grams into one vocabulary in Spark?
                            
                                How do I call a database function using SQLAlchemy in Flask?
                            
                                Reorder Python argparse argument groups
                            
                                python: update dataframe to existing excel sheet without overwriting contents on the same sheet and other sheets
                            
                                Flask send stream as response
                            
                                Convert date to ordinal python?
                            
                                NetworkX: how to add weights to an existing G.edges()?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With