I am working on a project which performs text auto-classification, I have a lot of data set like as below:
xxxxx... | AA
yyyyy... | BB
zzzzz... | AA
then, I will use the above data set to generate a classifier, once new text coming, the classifier can label new text with correct CategoryName (text is natural language, size between 10-10000)
Now, the problem is, the original data set contains some incorrect data, (E.g. AAA should be labeled as Category AA, but it is labeled as Category BB accidentally ) because these data are classified manually. And I don't know which label is wrong and how many percentages are wrong because I can't review all data manually...
So my question is, what should I do?
Obviously, there is no easy way to solve your problem - after all, why build a classifier if you already have a system that can detect wrong classifications.
Do you know how much the erroneous classifications affect your learning? If there are only a small percentage of them, they should not hurt the performance much. (Edit. Ah, apparently you don't. Anyway, I suggest you try it out - at least if you can identify a false result when you see one.)
Of course, you could always first train your system and then have it suggest classifications for the training data. This might help you identify (and correct) your faulty training data. This obviously depends on how much training data you have, and if it is sufficiently broad to allow your system to learn correct classification despite the faulty data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With