Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Good data set for Pre-processing

I am enrolled in an under-graduate course in Data Mining and I've got an assignment to code a Data Mining Pre-processor. I have the liberty to choose the programming language and the data set. I was wondering if anybody could suggest a good data set to use. I have been going through the UCI Repository and I've found many more such resources. But being a beginner I am not sure which data set would be a good choice. The preprocessor should be dealing with the following stuff:

  • Data cleaning
    • Missing Values
    • Errors
    • Outliers
    • Nomralization
    • De-duplication
  • Data Reduction
    • Sampling Techniques
    • Dimensionality Reduction

What kind of properties should I consider when choosing the data set? Any specific data set you would suggest?

like image 718
pcx Avatar asked Apr 15 '26 20:04

pcx


1 Answers

You answered your own question. Choose list of data-set with the properties that you have mentioned as UCI repository has categorized dataset. You can chose anyone to start playing with it.

So to start with, if I were you,I would proceed step wise, have a feel how each of those look like and its effect on classifier performance and choose some of the popular dataset as they are used as benchmark dataset in most of the research paper. Much of those that you have listed are separate machine learning problems with lots of researches being conducted.

I would start with something like this :
for missing values : Iris, Voting,Heart disease
for Duplicate:921,810 song dataset(not form UCI I think)
Normalization : Any continuous valued dataset with different range for features
Sampling technique : Pima
Dimensionality reduction : Swiss Roll

Further, another best approach to look for the data set would be to refer some of respective publications. Such as , for dimensionality reduction, you can look into papers of PCA, ISOMAP etc, for sampling see SMOTE paper etc and see what type of data do they use for their experiments and proceed accordingly.

like image 83
iinception Avatar answered Apr 17 '26 12:04

iinception