I work with a fair number of data sets that have many records -- often in the millions of records. It seems to me that not all of these records are equally useful for building an effective model of the data, e.g., because there are duplicates in the data set. These data sets could be much easier and faster to analyze if they were reduced to a better set of records.
What preprocessing methods are there for reducing data set size (e.g., removing records) without losing information for machine learning problems?
I know one simple transformation is to summarize duplicate records and weight them accordingly, but is there anything more advanced than that?
Data preprocessing is the process of transforming raw data into an understandable format. It is also an important step in data mining as we cannot work with raw data. The quality of the data should be checked before applying machine learning or data mining algorithms.
That's a very interesting problem indeed. Firstly, defining an information measure for your datasets is already a challenge. Once you have that, you should be able to measure the difference between original and reduced datasets.
As you mentioned, removing duplicated records could be an option but won't help if there aren't many. Depending on the distribution of your records you might just select a set randomly or maybe following a stratified approach (see e.g. density preserving sampling).
Other approach to drastically reduce the number of records is prototypes selection in which representative records of your data are selected using nearest neighbours (see http://sci2s.ugr.es/pr for academic papers).
Let's assume you are doing K nearest neighbor classification. Cluster the training data into sufficiently many clusters to ensure that each cluster is homogeneous, i.e., all its exemplars are from the same class. Then, for each cluster, select one typical exemplar and discard the rest.
Of course, it may be that your intuition that large amounts of data are not valuable may be incorrect: "it's not who has the best algorithm who wins, it's who has the most data."
And if you add a weight / count for how many times a feature-set has identical data, you have increased your memory requirement by m * 32-bits (or whatever), so you might not be coming out ahead unless you have a lot of duplicates or a large feature-set.
The suggestion to use PCA makes sense, because by reducing the size of each record, again you're getting m * (however much you've saved / record).
I also thing the suggestion to use k-means is a good one, although I would probably use the centroid of the cluster as my exemplar (rather than a representative data point). If you go this route, I think you would definitely want to include a count/weight of how much data there is in that cluster. After all, the fact that data is duplicated is probably highly relevant in many models!
Sometimes the simplest methods are best... Random sampling is easy to understand, hard to screw up, and unlikely to introduce bias into your process. Building a training pipeline using a random sample (without replacement) of your dataset is a good way to work faster. Once you have a pipeline you're satisfied with, you can then run it again over your entire dataset to estimate the gain in performance from using the entire dataset.
If your training pipeline is robust, your results should not change too much, and although your performance might rise, it will tend to do so very slowly as you add more data. The basic intuition here is that the strongest signals in your data will show up even with relatively small samples of the data, almost by definition (if they didn't, they wouldn't be strong!). Using more and more data does allow you to capture more and more subtle patterns but you face diminishing returns.
I should add that training certain kinds of models on millions of examples should be fairly fast on easily-available hardware.
Graphs showing the tradeoffs of both training speed and accuracy vs number of examples can be found here: https://github.com/szilard/benchm-ml
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With