python: taking random sample from data but keeping the same distribution

Question

I have a training data that has 20,000 and more instances, split into 3 classes, with a distribution like A=10%, B=20%, C=70%. Is there a way in sklearn or pandas or anything else that can take a sample of 10% from this data but at the same time respecting the distribution of different classes? As I need do grid search on the data but the original dataset is too high dimensional (20,000 x 12,000 feature dimension)

The train_test_split will keep the distribution but it only splits the entire dataset into two sets, which are still too large.

Thanks

The train_test_split will keep the distribution but it only splits the entire dataset into two sets, which are still too large.

Thanks

shivsn · Accepted Answer

You should use Stratifiefkfold. The folds are made by preserving the percentage of samples for each class. See the documentation for using it.

John Damen · Answer

The train_test_split function allows a definition of the size of the training data:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

See the docs

python: taking random sample from data but keeping the same distribution

Tags:

python

pandas

scikit-learn

Ziqi

2 Answers

shivsn

John Damen

Recent Activity

Donate For Us

python: taking random sample from data but keeping the same distribution

Tags:

python

pandas

scikit-learn

Ziqi

2 Answers

shivsn

John Damen

Related questions

Recent Activity

Donate For Us