Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I sample equally from a dataframe?

Tags:

python

pandas

Suppose I have some observations, each with an indicated class from 1 to n. Each of these classes may not necessarily occur equally in the data set.

How can I equally sample from the dataframe? Right now I do something like...

frames = []
classes = df.classes.unique()

for i in classes:
    g = df[df.classes = i].sample(sample_size)
    frames.append(g)

equally_sampled = pd.concat(frames)

Is there a pandas function to equally sample?

like image 585
Demetri Pananos Avatar asked Nov 17 '16 02:11

Demetri Pananos


People also ask

How do you find the equality of a DataFrame?

The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements.

How do you take the sample of a data set in Python?

Python pandas provides a function, named sample() to perform random sampling. The number of samples to be extracted can be expressed in two alternative ways: specify the exact number of random rows to extract. specify the percentage of random rows to extract.

Which function can be used to sample rows from a pandas series stored in the variable DS?

Pandas sample() is used to generate a sample random row or column from the function caller data frame. Parameters: n: int value, Number of random rows to generate. frac: Float value, Returns (float value * length of data frame values ).

How do you do random sampling in a Dataframe?

Given a dataframe with N rows, random Sampling extract X random rows from the dataframe, with X ≤ N. Python pandas provides a function, named sample () to perform random sampling. The number of samples to be extracted can be expressed in two alternative ways: specify the percentage of random rows to extract.

What is Dataframe sample in pandas?

pandas.DataFrame.sample ¶ DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None) [source] ¶ Return a random sample of items from an axis of object. You can use random_state for reproducibility.

How can I get two DataFrames from one dataset?

NOTE: If you want to keep a representative dataset and your only problem is the size of it, I would suggest getting a stratified sample instead. A stratified sample makes it sure that the distribution of a column is the same before and after sampling. With the above, you will get two dataframes. The first will be 20% of the whole dataset.

How do I return a random row from a Dataframe?

Return one random sample row of the DataFrame. In this example we use a .csv file called data.csv The sample () method returns a specified number of random rows. The sample () method returns 1 row if a number is not specified. Note: The column names will also be returned, in addition to the sample rows.


1 Answers

For more elegance you can do this:

df.groupby('classes').apply(lambda x: x.sample(sample_size))

Extension:

You can make the sample_size a function of group size to sample with equal probabilities (or proportionately):

nrows = len(df)
total_sample_size = 1e4
df.groupby('classes').\
    apply(lambda x: x.sample(int((x.count()/nrows)*total_sample_size)))

It won't result in the exact number of rows as total_sample_size but sampling will be more proportional than the naive method.

like image 135
Kartik Avatar answered Oct 11 '22 15:10

Kartik