I'm not sure it's the right exchange site for machine learning questions but I did see ML questions before so I'm trying my luck (also posted at http://math.stackexchange.com).
I have training instances that come from different sources so building one model doesn't work well. Is there a known method to use in such cases?
Example explains best. Let's say I want to classify cancer/non-cancer given training data that was constructed based on different populations. Training instances from one population might have a completely different distribution of positive/negative examples than in other populations. Now, I can build a separate model for each population, but the problem is that for testing I don't know from which population the test instance is coming from.
*all training/testing instances have the exact same feature set regardless of the population they came from.
I suspect that this might not work any better than just throwing all your data into a single classifier trained on the whole set. From a high level, the features of the data set should tell you the labels, not the input distribution. But you could try it.
Train a separate classifier for each data set that tries to predict the label. Then train a classifier on the combined distribution, that tries to predict which dataset the data point came from. Then when you want to predict the label for a test instance, use each sub classifier, and give it weight proportional to the weight assigned by the high level data set classifier.
This feels a lot like the estimation step in mixture of Gaussian, where you assign a probability of generating a data point by taking a probability weighted average assigned by estimates from the K centers.
A classical approach for this is via hierarchical modeling (if you can have hierarchies), fixed effects models (or random effects, depending on the assumptions and circumstances), various other group or structural models.
You can do the same in a machine learning context by describing the distributions as a function of the source, both in terms of the sample populations and the response variable(s). Thus, source is essentially a feature that could potentially interact with all (or most) of the other features.
The bigger question is whether your future (test) data will come from one of these sampling populations or yet another population.
Update 1: If you want to focus on machine learning, rather than statistics, another related concept to look into is transfer learning. It's not terribly complicated, though it is rather hyped. The basic idea is that you find common properties in the auxiliary data distributions that can be mapped into the predictor/response framework of the target data source. In another sense, you're looking for a way to exclude source-dependent variation. These are very high level descriptions, but should help in your reading plans.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With