I have prepared a dataset to recognise a certain type of objects (about 2240 negative object examples and only about 90 positive object examples). However, after calculating 10 features for each object in the dataset, the number of unique training instances dropped to about 130 and 30, respectively.
Since the identical training instances actually represent different objects, can I say that this duplication holds relevant information (e.g. the distribution of object feature values), which may be useful in one way or another?
This, in turn, leads to organisations holding more than one record of someone – possibly with conflicting information. Identifying and removing or merging these duplicate records from your database is a key part of forming an effective Single Customer View (SCV).
Duplicate entries can ruin the split between train, validation, and test sets where identical entries are not all in the same set. This can lead to biased performance estimates that result in disappointing the model in production.
Ideally, once you have more training examples you'll have lower test-error (variance of the model decrease, meaning we are less overfitting), but theoretically, more data doesn't always mean you will have more accurate model since high bias models will not benefit from more training examples.
If you omit the duplicates, that will skew the base rate of each distinct object. If the training data are a representative sample of the real world, then you don't want that, because you will actually be training for a slightly different world (one with different base rates).
To clarify the point, consider a scenario in which there are just two distinct objects. Your original data contains 99 of object A and 1 of object B. After throwing out duplicates, you have 1 object A and 1 object B. A classifier trained on the de-duplicated data will be substantially different than one trained on the original data.
My advice is to leave the duplicates in the data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With