Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Refreshing training data for supervised learning - how to?

We have a classifier for web pages. The classifier model was built with train data from some 2 years ago. We've noticed the model's performance keeps deteriorating, and we assume its due to properties of web pages changing over time (mainly used words and terminology, but also topology, html tags, etc.).

How would you approach this problem? do we simply re-build the entire train data and re-learn a new model? Is there a shortcut? Are there some common practices or papers on how to do it? Note that we are pretty hooked on the supervised learning approach where the system admins train a classifier, evaluate its performance on a test set and then install the classifier in the "production" system.

Hope this isn't too vague...

like image 717
ihadanny Avatar asked Sep 29 '22 20:09

ihadanny


1 Answers

There are a number of factors that may come into consideration, the major ones being the state of the classifier and the data.

If you do not require any new inputs as a result of changing web protocols, then you may be able to retrain your existing classifier on fresh data.

If the classifier has not been designed to be retrained on new data, it may be difficult to salvage the old model. Likewise, if the inputs or outputs have changed, it may also be easier to build a new classifier.

I don't know what classifier you are using, or the means to retrain or process your data, so I can't provide a direct answer to the issue you are facing, or if there are any shortcuts to the problem. It really comes down to how accessible your classifier is and the cost of maintaining it.

As stated in your question above, it would be recommended that the new classifier be tested and compared to confirm that it meets the requirements before applying it to the production environment.

like image 61
Matthew Spencer Avatar answered Oct 05 '22 08:10

Matthew Spencer