Should I keep/remove identical training examples that represent different objects?

Tags:

I have prepared a dataset to recognise a certain type of objects (about 2240 negative object examples and only about 90 positive object examples). However, after calculating 10 features for each object in the dataset, the number of unique training instances dropped to about 130 and 30, respectively.

Since the identical training instances actually represent different objects, can I say that this duplication holds relevant information (e.g. the distribution of object feature values), which may be useful in one way or another?

415

asked Oct 04 '14 22:10

Sultan Abraham

1 Answers

If you omit the duplicates, that will skew the base rate of each distinct object. If the training data are a representative sample of the real world, then you don't want that, because you will actually be training for a slightly different world (one with different base rates).

To clarify the point, consider a scenario in which there are just two distinct objects. Your original data contains 99 of object A and 1 of object B. After throwing out duplicates, you have 1 object A and 1 object B. A classifier trained on the de-duplicated data will be substantially different than one trained on the original data.

My advice is to leave the duplicates in the data.

answered Sep 17 '22 13:09

Robert Dodier

Related questions
                            
                                document image processing
                            
                                Choosing Features to identify Twitter Questions as "Useful"
                            
                                How to determine the learning rate and the variance in a gradient descent algorithm？
                            
                                How to parse product titles (unstructured) into structured data?
                            
                                Affinity Propagation preferences initialization
                            
                                Using Reinforcement Learning for Classfication Problems [closed]
                            
                                Returning probabilities in a classification prediction in Keras?
                            
                                Can sklearn DecisionTreeClassifier truly work with categorical data?
                            
                                Neural Network: Handling unavailable inputs (missing or incomplete data) [closed]
                            
                                Multi-Class SVM( one versus all)
                            
                                Training Naive Bayes Classifier on ngrams
                            
                                VC Dimension of Circle, a special case
                            
                                Max-pooling VS Sum-pooling
                            
                                Keras error: expected dense_input_1 to have 3 dimensions
                            
                                Does EarlyStopping in Keras save the best model?
                            
                                Why is ReLU a non-linear activation function?
                            
                                How to write to TensorBoard in TensorFlow 2
                            
                                k-fold cross validation using DataLoaders in PyTorch
                            
                                Best learning algorithm to make a decision tree in java?
                            
                                Neural Activation Functions - Difference between Logistic / Tanh / etc

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Should I keep/remove identical training examples that represent different objects?

Tags:

machine-learning

statistics

classification

training-data

Sultan Abraham

People also ask

1 Answers

Robert Dodier

Recent Activity

Donate For Us