Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Machine learning project: split training/test sets before or after exploratory data analysis?

Is it best to split your data into training and test sets before doing any exploratory data analysis, or do all exploration based solely on training data?

I'm working on my first full machine learning project (a recommendation system for a course capstone project) and am looking for clarification on order of operations. My rough outline is to import and clean, do exploratory analysis, train my model, and then evaluate on a test set.

I am doing exploratory data analysis now - nothing special initially, just starting with variable distributions and whatnot. But I am not sure: should I split my data into training and test sets before or after exploratory analysis?

I don't want to potentially contaminate algorithm training by inspecting the test set. However, I also don't want to miss visual trends that might reflect real signal that my poor human eye might not see after filtering, and thus potentially miss investigating an important and relevant direction while designing my algorithm.

I checked other threads, like this, but the ones I found seem to ask more about things like regularization or actual manipulation of the original data. The answers I found were mixed but prioritized splitting first. However, I don't plan to do any actual manipulation of the data before splitting it (beyond inspecting distributions and potentially doing some factor conversions).

What do you do in your own work and why?

Thanks for helping a new programmer!

like image 592
Amy Gill Avatar asked Jan 21 '19 01:01

Amy Gill


People also ask

Should you split data before EDA?

You should do splitting after EDA. EDA includes pre-processing of data like missing values,outliers,scaling changing variables,etc after that only do spiltting. as soon as when new test data arrive it should mandatory pass through all the preprocess data preparation step as you have done for data before splitting.

How should you split up a dataset into test and training sets?

Split the data set into two pieces — a training set and a testing set. This consists of random sampling without replacement about 75 percent of the rows (you can vary this) and putting them into your training set. The remaining 25 percent is put into your test set.

Why do you need to split the data into training and test data prior to beginning model training?

Separating data into training and testing sets is an important part of evaluating data mining models. Typically, when you separate a data set into a training set and testing set, most of the data is used for training, and a smaller portion of the data is used for testing.

Do we need to do EDA on test data?

Why EDA is important? The main purpose of EDA is to detect any errors, outliers as well as to understand different patterns in the data. It allows Analysts to understand the data better before making any assumptions.


1 Answers

To answer this question, we should remind ourselves of why, in machine learning, we split data into training, validation and testing sets (see also this question).

Training sets are used for model development. We often carefully explore this data to get ideas for feature engineering and the general structure of the machine learning model. We then train the model using the training data set.

Usually, our goal is to generate models that will perform well not only on the training data, but also on previously unseen data. Therefore, we want to avoid models that capture the peculiarities of the data we have available now rather than the general structure of the data we will see in the future ("overfitting"). To do so, we assess the quality of the models we're training by evaluating their performance on a different set of data, the validation data, and choose the model that performs best on the validation data.

Having trained our final model, we often want to have an unbiased estimate of its performance. Since we have already used the validation data in the process of model development (we chose the model that performed best on the validation data), we cannot be sure that our model will perform equally well on unseen data. So, to assess model quality, we test performance unsing a new batch of data, the testing data.

This discussion gives the answer your question: We should not use the testing (or validation) data set for exploratory data analysis. Because if we did, we would run the risk of overfitting the model to the peculiarities of the data we have, for example by engineering features that work well for the testing data. At the same time, we would lose the ability of getting an unbiased estimate of our model's performance.

like image 91
Fabian Avatar answered Nov 23 '22 16:11

Fabian