I downloaded Skin Segmentation Data Set and found that it contains a lot of duplicates.
For example, this row 0 128 0 2
encountered 199 times.
Please, supply a few examples when duplicates is good and when is evil.
Yes of course, because if it is a random sample, that represents the underlying distribution in the data, that tells you that this particular value has a higher probability. Removing duplicates will just render the dataset pretty useless.
It is important.
For example: If row 'a' appears 5 times in your data and another row, 'b', appears only once, then you will want to classify row 'a' better than 'b' because when you will calculate the cost function, row 'a' will appear more time and have a bigger influence on the cost.
And, if your training represents well the test data, then there is a high probability that row 'a' will appear more times than row 'b' there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With