Machine learning project: split training/test sets before or after exploratory data analysis?

Tags:

Is it best to split your data into training and test sets before doing any exploratory data analysis, or do all exploration based solely on training data?

I'm working on my first full machine learning project (a recommendation system for a course capstone project) and am looking for clarification on order of operations. My rough outline is to import and clean, do exploratory analysis, train my model, and then evaluate on a test set.

I am doing exploratory data analysis now - nothing special initially, just starting with variable distributions and whatnot. But I am not sure: should I split my data into training and test sets before or after exploratory analysis?

I don't want to potentially contaminate algorithm training by inspecting the test set. However, I also don't want to miss visual trends that might reflect real signal that my poor human eye might not see after filtering, and thus potentially miss investigating an important and relevant direction while designing my algorithm.

I checked other threads, like this, but the ones I found seem to ask more about things like regularization or actual manipulation of the original data. The answers I found were mixed but prioritized splitting first. However, I don't plan to do any actual manipulation of the data before splitting it (beyond inspecting distributions and potentially doing some factor conversions).

What do you do in your own work and why?

Thanks for helping a new programmer!

592

asked Jan 21 '19 01:01

Amy Gill

1 Answers

To answer this question, we should remind ourselves of why, in machine learning, we split data into training, validation and testing sets (see also this question).

Training sets are used for model development. We often carefully explore this data to get ideas for feature engineering and the general structure of the machine learning model. We then train the model using the training data set.

Usually, our goal is to generate models that will perform well not only on the training data, but also on previously unseen data. Therefore, we want to avoid models that capture the peculiarities of the data we have available now rather than the general structure of the data we will see in the future ("overfitting"). To do so, we assess the quality of the models we're training by evaluating their performance on a different set of data, the validation data, and choose the model that performs best on the validation data.

Having trained our final model, we often want to have an unbiased estimate of its performance. Since we have already used the validation data in the process of model development (we chose the model that performed best on the validation data), we cannot be sure that our model will perform equally well on unseen data. So, to assess model quality, we test performance unsing a new batch of data, the testing data.

This discussion gives the answer your question: We should not use the testing (or validation) data set for exploratory data analysis. Because if we did, we would run the risk of overfitting the model to the peculiarities of the data we have, for example by engineering features that work well for the testing data. At the same time, we would lose the ability of getting an unbiased estimate of our model's performance.

answered Nov 23 '22 16:11

Fabian

Related questions
                            
                                Planned contrasts using ezANOVA output in R
                            
                                How to output literal backticks in knitr::spin
                            
                                base R faster than readr for reading multiple CSV files
                            
                                Rounding off values in the Kable
                            
                                mclapply with big objects - "serialization is too large to store in a raw vector"
                            
                                How to sample large database and implement K-means and K-nn in R?
                            
                                Publishing from R+knitr to WordPress?
                            
                                Error Objects in \usage without \alias in documentation object from R CMD Check
                            
                                R clients to OLAP MDX servers
                            
                                Error deleting factor column in empty data.table
                            
                                Intersect dataframe on multiple columns [duplicate]
                            
                                Using R's GPU packages on Amazon
                            
                                Boxplot width in ggplot with cross classified groups
                            
                                Is there a way to call the `[<-` function in `[` form?
                            
                                mutate() is trying to extract using the value of a global variable when using the dollar sign operator
                            
                                R circlize: Error in circos.initialize
                            
                                Problems installing r package via devtools install_github
                            
                                World map showing day and night regions
                            
                                Gitbook chapter bibliography not in alphabetical order
                            
                                Partitioning data on a variable to speed up "fuzzy match" using stringdist

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Machine learning project: split training/test sets before or after exploratory data analysis?

Tags:

r

machine-learning

data-analysis

Amy Gill

People also ask

1 Answers

Fabian

Recent Activity

Donate For Us