Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I keep/remove identical training examples that represent different objects?

I have prepared a dataset to recognise a certain type of objects (about 2240 negative object examples and only about 90 positive object examples). However, after calculating 10 features for each object in the dataset, the number of unique training instances dropped to about 130 and 30, respectively.

Since the identical training instances actually represent different objects, can I say that this duplication holds relevant information (e.g. the distribution of object feature values), which may be useful in one way or another?

like image 415
Sultan Abraham Avatar asked Oct 04 '14 22:10

Sultan Abraham


People also ask

Should duplicate data be removed?

This, in turn, leads to organisations holding more than one record of someone – possibly with conflicting information. Identifying and removing or merging these duplicate records from your database is a key part of forming an effective Single Customer View (SCV).

Why is it important to remove duplicate data in machine learning?

Duplicate entries can ruin the split between train, validation, and test sets where identical entries are not all in the same set. This can lead to biased performance estimates that result in disappointing the model in production.

How does number of training examples influence accuracy?

Ideally, once you have more training examples you'll have lower test-error (variance of the model decrease, meaning we are less overfitting), but theoretically, more data doesn't always mean you will have more accurate model since high bias models will not benefit from more training examples.


1 Answers

If you omit the duplicates, that will skew the base rate of each distinct object. If the training data are a representative sample of the real world, then you don't want that, because you will actually be training for a slightly different world (one with different base rates).

To clarify the point, consider a scenario in which there are just two distinct objects. Your original data contains 99 of object A and 1 of object B. After throwing out duplicates, you have 1 object A and 1 object B. A classifier trained on the de-duplicated data will be substantially different than one trained on the original data.

My advice is to leave the duplicates in the data.

like image 74
Robert Dodier Avatar answered Sep 17 '22 13:09

Robert Dodier