Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are duplicates useful in data sets?

I downloaded Skin Segmentation Data Set and found that it contains a lot of duplicates.
For example, this row 0 128 0 2 encountered 199 times.

Please, supply a few examples when duplicates is good and when is evil.

like image 744
MrPisarik Avatar asked Oct 18 '22 17:10

MrPisarik


2 Answers

Yes of course, because if it is a random sample, that represents the underlying distribution in the data, that tells you that this particular value has a higher probability. Removing duplicates will just render the dataset pretty useless.

like image 73
latorrefabian Avatar answered Oct 29 '22 22:10

latorrefabian


It is important.

For example: If row 'a' appears 5 times in your data and another row, 'b', appears only once, then you will want to classify row 'a' better than 'b' because when you will calculate the cost function, row 'a' will appear more time and have a bigger influence on the cost.

And, if your training represents well the test data, then there is a high probability that row 'a' will appear more times than row 'b' there.

like image 25
manbearpig Avatar answered Oct 29 '22 21:10

manbearpig