what should I do when training set contains some error data in supervised classification?

Question

I am working on a project which performs text auto-classification, I have a lot of data set like as below:

Text | CategoryName

xxxxx... | AA

yyyyy... | BB

zzzzz... | AA

then, I will use the above data set to generate a classifier, once new text coming, the classifier can label new text with correct CategoryName (text is natural language, size between 10-10000)

Now, the problem is, the original data set contains some incorrect data, (E.g. AAA should be labeled as Category AA, but it is labeled as Category BB accidentally ) because these data are classified manually. And I don't know which label is wrong and how many percentages are wrong because I can't review all data manually...

So my question is, what should I do?

Can I find the wrong labels via some automatic way?
How to increase precision and recall when new data coming?
How to evaluate the impact of wrong data? (since I don't know how many percentage data is wrong)
Any other suggestions?

Weetu · Accepted Answer

Obviously, there is no easy way to solve your problem - after all, why build a classifier if you already have a system that can detect wrong classifications.

Do you know how much the erroneous classifications affect your learning? If there are only a small percentage of them, they should not hurt the performance much. (Edit. Ah, apparently you don't. Anyway, I suggest you try it out - at least if you can identify a false result when you see one.)

Of course, you could always first train your system and then have it suggest classifications for the training data. This might help you identify (and correct) your faulty training data. This obviously depends on how much training data you have, and if it is sufficiently broad to allow your system to learn correct classification despite the faulty data.

what should I do when training set contains some error data in supervised classification?

Tags:

machine-learning

classification

nlp

document-classification

Text | CategoryName

Clover

1 Answers

Weetu

Recent Activity

Donate For Us

what should I do when training set contains some error data in supervised classification?

Tags:

machine-learning

classification

nlp

document-classification

Text | CategoryName

Clover

1 Answers

Weetu

Related questions

Recent Activity

Donate For Us