What are effective preprocessing methods for reducing data set size (e.g., removing records) without losing information for machine learning problems?

Tags:

I work with a fair number of data sets that have many records -- often in the millions of records. It seems to me that not all of these records are equally useful for building an effective model of the data, e.g., because there are duplicates in the data set. These data sets could be much easier and faster to analyze if they were reduced to a better set of records.

What preprocessing methods are there for reducing data set size (e.g., removing records) without losing information for machine learning problems?

I know one simple transformation is to summarize duplicate records and weight them accordingly, but is there anything more advanced than that?

793

asked Mar 19 '16 22:03

Randy Olson

4 Answers

That's a very interesting problem indeed. Firstly, defining an information measure for your datasets is already a challenge. Once you have that, you should be able to measure the difference between original and reduced datasets.

As you mentioned, removing duplicated records could be an option but won't help if there aren't many. Depending on the distribution of your records you might just select a set randomly or maybe following a stratified approach (see e.g. density preserving sampling).

Other approach to drastically reduce the number of records is prototypes selection in which representative records of your data are selected using nearest neighbours (see http://sci2s.ugr.es/pr for academic papers).

115

answered Oct 03 '22 15:10

DraXus

Let's assume you are doing K nearest neighbor classification. Cluster the training data into sufficiently many clusters to ensure that each cluster is homogeneous, i.e., all its exemplars are from the same class. Then, for each cluster, select one typical exemplar and discard the rest.

answered Oct 03 '22 15:10

user1389890

Of course, it may be that your intuition that large amounts of data are not valuable may be incorrect: "it's not who has the best algorithm who wins, it's who has the most data."

And if you add a weight / count for how many times a feature-set has identical data, you have increased your memory requirement by m * 32-bits (or whatever), so you might not be coming out ahead unless you have a lot of duplicates or a large feature-set.

The suggestion to use PCA makes sense, because by reducing the size of each record, again you're getting m * (however much you've saved / record).

I also thing the suggestion to use k-means is a good one, although I would probably use the centroid of the cluster as my exemplar (rather than a representative data point). If you go this route, I think you would definitely want to include a count/weight of how much data there is in that cluster. After all, the fact that data is duplicated is probably highly relevant in many models!

answered Oct 03 '22 15:10

Larry OBrien

Sometimes the simplest methods are best... Random sampling is easy to understand, hard to screw up, and unlikely to introduce bias into your process. Building a training pipeline using a random sample (without replacement) of your dataset is a good way to work faster. Once you have a pipeline you're satisfied with, you can then run it again over your entire dataset to estimate the gain in performance from using the entire dataset.

If your training pipeline is robust, your results should not change too much, and although your performance might rise, it will tend to do so very slowly as you add more data. The basic intuition here is that the strongest signals in your data will show up even with relatively small samples of the data, almost by definition (if they didn't, they wouldn't be strong!). Using more and more data does allow you to capture more and more subtle patterns but you face diminishing returns.

I should add that training certain kinds of models on millions of examples should be fairly fast on easily-available hardware.

Graphs showing the tradeoffs of both training speed and accuracy vs number of examples can be found here: https://github.com/szilard/benchm-ml

answered Oct 03 '22 15:10

nicolaskruchten

Related questions
                            
                                Basic understanding of the Adaboost algorithm
                            
                                What are the advantages or disadvantages of having multiple output nodes compared to a few within a neural network
                            
                                Implementations of local regression and local likelihood methods
                            
                                Implementing Support Vector Machine - EFFICIENTLY computing gram-matrix K
                            
                                How to train image (pixel) data in libsvm format to use for recognition with Java
                            
                                scikit learn clf.fit / score model accuracy
                            
                                SVM - relation between the number of training samples and the number of features
                            
                                Rescaling after feature scaling, linear regression
                            
                                Binning of continuous variables in sklearn ensemble and trees
                            
                                Gaussian-RBM fails on a trivial example
                            
                                which is best svm example which classifies plain input text?
                            
                                Vowpal Wabbit training and testing data formats
                            
                                Cannot connect PlainText (JSON) to Dataset at Azure Machine Learning
                            
                                Doing hyperparameter estimation for the estimator in each fold of Recursive Feature Elimination
                            
                                Learning rate of a Q learning agent
                            
                                Accuracy issue in caffe
                            
                                get function by its values in certain points
                            
                                Missing Value in Data Analysis
                            
                                Want to know the diff among pd.factorize, pd.get_dummies, sklearn.preprocessing.LableEncoder and OneHotEncoder [closed]
                            
                                How to map features from the output of a VectorAssembler back to the column names in Spark ML?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are effective preprocessing methods for reducing data set size (e.g., removing records) without losing information for machine learning problems?

Tags:

machine-learning

dataset

Randy Olson

People also ask

4 Answers

DraXus

user1389890

Larry OBrien

nicolaskruchten

Recent Activity

Donate For Us