Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python: taking random sample from data but keeping the same distribution

I have a training data that has 20,000 and more instances, split into 3 classes, with a distribution like A=10%, B=20%, C=70%. Is there a way in sklearn or pandas or anything else that can take a sample of 10% from this data but at the same time respecting the distribution of different classes? As I need do grid search on the data but the original dataset is too high dimensional (20,000 x 12,000 feature dimension)

The train_test_split will keep the distribution but it only splits the entire dataset into two sets, which are still too large.

Thanks

like image 526
Ziqi Avatar asked Mar 07 '26 13:03

Ziqi


2 Answers

You should use Stratifiefkfold. The folds are made by preserving the percentage of samples for each class. See the documentation for using it.

like image 102
shivsn Avatar answered Mar 09 '26 03:03

shivsn


The train_test_split function allows a definition of the size of the training data:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

See the docs

like image 40
John Damen Avatar answered Mar 09 '26 02:03

John Damen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!