When to use train_test_split of scikit learn

Tags:

I have a dataset having 19 features. Now I need to do missing value imputation, then encoding the categorical variables using OneHOtEncoder of scikit and then run a machine learning algo.

My question is should I split this dataset before doing all the above things using train_test_split method of scikit or should I first split into train and test and then on each set of data, do missing value and encoding.

My concern is if I split first then do missing value and other encoding on resulting two sets, when doing encoding of variables in test set, shouldn't test set would have some values missing for that variable there maybe resulting in less no. of dummies. Like if original data had 3 levels for categorical and I know we are doing random sampling but is there a chance that the test set might not have all three levels present for that variable thereby resulting in only two dummies instead of three in first?

What's the right approach. Splitting first and then doing all of the above on train and test or do missing value and encoding first on whole dataset and then split?

710

asked May 04 '15 23:05

Baktaawar

1 Answers

I would first split the data into a training and testing set. Your missing value imputation strategy should be fitted on the training data and applied both on the training and testing data.

For instance, if you intend to replace missing values by the most frequent value or the median. This knowledge (median, most frequent value) must be obtained without having seen the testing set. Otherwise, your missing value imputation will be biased. If some values of feature are unseen in the training data, then you can for instance increasing your overall number of samples or have a missing value imputation strategy robust to outliers.

Here is an example how to perform missing value imputation using a scikit-learn pipeline and imputer:

174

answered Oct 04 '22 13:10

Arnaud Joly

Related questions
                            
                                Why does Python crash when I try to sum this numpy array?
                            
                                I can't find what's wrong with this circle bounce calculation in python
                            
                                Google App Engine runs no instances after a successful deployment
                            
                                Daemonize Celerybeat in Elastic Beanstalk(AWS)
                            
                                memory leak calling cython function with large numpy array parameters?
                            
                                High availability for Python's asyncio
                            
                                How to document argument that takes multiple types
                            
                                pyUSB get a continuous stream of data from sensor
                            
                                Flask wtf form AttributeError: 'Request' object has no attribute 'POST'
                            
                                Include mouse cursor in screenshot
                            
                                How to return "already exists" error in Flask-restless?
                            
                                Break a long assignment into two lines in Python [duplicate]
                            
                                Vim plugin for automatically generating Python import statements (without using Rope)
                            
                                How to specify boundary behavior for SciPy's interp1d
                            
                                Can Python's asyncio.coroutine be thought of as a generator?
                            
                                "scoring must return a number" cross_val_score error in scikit-learn
                            
                                Modified BPMF in PyMC3 using `LKJCorr` priors: PositiveDefiniteError using `NUTS`
                            
                                How do I document the Jupyter Notebook Profile startup?
                            
                                How do I change the serializer that my multiprocessing.mangers.BaseManager subclass uses to cPickle?
                            
                                GenericRelatedObjectManager not JSON serializable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When to use train_test_split of scikit learn

Tags:

python

pandas

machine-learning

numpy

scikit-learn

Baktaawar

People also ask

1 Answers

Arnaud Joly

Recent Activity

Donate For Us