Suppose I have some observations, each with an indicated class from 1
to n
. Each of these classes may not necessarily occur equally in the data set.
How can I equally sample from the dataframe? Right now I do something like...
frames = []
classes = df.classes.unique()
for i in classes:
g = df[df.classes = i].sample(sample_size)
frames.append(g)
equally_sampled = pd.concat(frames)
Is there a pandas function to equally sample?
The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements.
Python pandas provides a function, named sample() to perform random sampling. The number of samples to be extracted can be expressed in two alternative ways: specify the exact number of random rows to extract. specify the percentage of random rows to extract.
Pandas sample() is used to generate a sample random row or column from the function caller data frame. Parameters: n: int value, Number of random rows to generate. frac: Float value, Returns (float value * length of data frame values ).
Given a dataframe with N rows, random Sampling extract X random rows from the dataframe, with X ≤ N. Python pandas provides a function, named sample () to perform random sampling. The number of samples to be extracted can be expressed in two alternative ways: specify the percentage of random rows to extract.
pandas.DataFrame.sample ¶ DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None) [source] ¶ Return a random sample of items from an axis of object. You can use random_state for reproducibility.
NOTE: If you want to keep a representative dataset and your only problem is the size of it, I would suggest getting a stratified sample instead. A stratified sample makes it sure that the distribution of a column is the same before and after sampling. With the above, you will get two dataframes. The first will be 20% of the whole dataset.
Return one random sample row of the DataFrame. In this example we use a .csv file called data.csv The sample () method returns a specified number of random rows. The sample () method returns 1 row if a number is not specified. Note: The column names will also be returned, in addition to the sample rows.
For more elegance you can do this:
df.groupby('classes').apply(lambda x: x.sample(sample_size))
You can make the sample_size
a function of group size to sample with equal probabilities (or proportionately):
nrows = len(df)
total_sample_size = 1e4
df.groupby('classes').\
apply(lambda x: x.sample(int((x.count()/nrows)*total_sample_size)))
It won't result in the exact number of rows as total_sample_size
but sampling will be more proportional than the naive method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With