Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sample Pandas dataframe based on values in column

I have a large dataframe that I want to sample based on values on the target column value, which is binary : 0/1

I want to extract equal number of rows that have 0's and 1's in the "target" column. I was thinking of using the pandas sampling function but not sure how to declare the equal number of samples I want from both classes for the dataframe based on the target column.

I was thinking of using something like this:

df.sample(n=10000, weights='target', random_state=1)

Not sure how to edit it to get 10k records with 5k 1's and 5k 0's in the target column. Any help is appreciated!

like image 808
mlenthusiast Avatar asked May 17 '19 18:05

mlenthusiast


People also ask

How get values from column in pandas?

You can use the loc and iloc functions to access columns in a Pandas DataFrame. Let's see how. If we wanted to access a certain column in our DataFrame, for example the Grades column, we could simply use the loc function and specify the name of the column in order to retrieve it.

How do you create a column sampling in Python?

The sample() function Here, df is the dataframe from which you want to sample the columns. By default, the sample() function returns one item, in the above case, a random column. But you can specify the number of columns to sample using the n parameter.


1 Answers

You can use DataFrameGroupBy.sample method as follwing:

sample_df = df.groupby("target").sample(n=5000, random_state=1)
like image 130
Ahmad Avatar answered Oct 29 '22 06:10

Ahmad