Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Machine learning philosophy: applying model to biased data

I have a machine learning problem and I don't know if there is a theoretical solution to it.

I have labeled data (let's call it dataset D1) to build a random forest classification model and it performs well.

Now my main interest is to apply this model on another dataset D2 which has zero labels, meaning I cannot use it for training. The only way to measure performance on D2 is to check the proportions of classes predicted from it.

Problem: D2 is skewed compared to D1 (features don't have the same mean or fit the same distribution). Because of this, the model applied to D2 gives heavily skewed results towards one class. I know this is normal because the majority of D2 is similar to a small subset of D1.

But is there any way to correct that skewness? I know from the nature of my problem the proportions of classes predicted should be less biased. I've tried normalization but it doesn't really help.

I feel I'm not thinking straight :3

like image 843
Luks Avatar asked Jan 07 '16 22:01

Luks


2 Answers

Intressting question. My answer on this has three parts.

Disclaimer: There is no free lunch. Hence, you can never by sure without checking the performance on the real test set labels. Worst case you have a concept drift in your problem which makes it impossible to predict your target class. However, there are solutions which can provide pretty good results

For notation:

The features are denoted by X the target variable by Y and the classifier learned by f(X) |-> Y. The distribution of X in D1 by P(X|D1) (abusing notation a bit)

Class distribution in Testset

You "postulated that one could use the distribution in the predicted variables ("check the proportions of classes predicted from it."). This however can merely be an indication. I'm building classifiers in industry to predict that a machine will fail (predictive maintenance). There are a lot of engineers trying to make my input data skew, this is making the machines producing the data more reliable. However, this is not a problem, as one class basically disappears. However, the classifiers are still valid.

There is a very simple way on the question "how to fix" the distribution in target labels on the test set. The idea would basically be to classify all test instances according to the predicted labels and sample (with replacement) the data points in accordance to the desired target variable distribution. You could then try to check the distribution on the features X but this wouldn't tell you too much.

Can the skewness be a problem? Indeed it can as a Classifier typically tries to minimise the accuracy of F1 measure or some other statistical property. If you would know in advance the distribution in the D2 you could provide a cost function which minimises the costs under this distribution. These costs can be used to resample the training data as mentioned in the other answer, however, some learning algorithms also have more elaborate techniques to incorporate this information.

Outlier Detection

One question is whether you can detect that something has changed in the inputs X. This is pretty important as this can indicate that you have had the wrong data. You can apply fairly simple tests like for example the mean and distribution in all dimensions. However, this ignores dependencies between the variables.

For the following two illustration im using the iris dataset enter image description here

Two techniques jump to my mind which allow you to detect that something in the data has changed. The first technique relies on the PCA transformation. Only for numerical but there are similar ideas for categorial features. PCA allows you to transform your input data into a lower dimensional space. this is PCA(X,t)=PCA([X1,...,Xn],t)=[Cond1,...,Condm]=Cond with a projection t Where typically with n<<m this transformation is still reversible such that PCA^1(Cond,t) = X' and the Error MSE(X,X') is small. To detect a problem you can monitor this error and once it increases you can say you mistrust your predictions.

If I build a PCA on all data from versicolor and virginica and plot the error in reconstruction two dimension (PCA on all iris dimensions) I get

enter image description here

however if versicolor is the new data the results are less convincing.

enter image description here

However, a PCA (or smth. similar) is done for numerical data anyhow, hence, it can give good indication without much overhead.

A second technique I'm aware of is based on so called One class Support Vector machines. Where a normal support vector machine would build a classifier which tries to separat two target classes Y. A one class support vector machine tries to separate seen from unseen data. Using this techniques is fairly attractive if you use a Support Vector Machine for classification. You would basically get two classifications. The first one says the target data and the second one whether similar data has been seen before.

If I build a one-class classifier on setosa and virginca and color by novelty I get the following graph:

enter image description here

As can be seen Data from versicolor appears to be quit suspicious. In that case it is a new class. However, if we would assume these are instances of virginica they are drifting dangerously close to the hyperplane.

Semi-supervised learning and Transductive

To solve your underlying problem. The idea of Transductive Learning, a special case of semi supervised learning might be intressting. In Semi supervise learning the training set consists of two parts. The labeled data and the unlabelled data. Semi-sup-l uses all this data to build the classifier. Transductive Learning is a special case where the unlabelled data is your test data D2. The idea was given by Vapnik as "don't try to solve a more complicated problem [building a classifier for all possible data] when you want to solve a simpler problem [predicting labels for D2]"

APENDIX

RCODE for plots

ggplot(iris)+aes(x=Petal.Width,y=Petal.Length,color=Species)+geom_point()+stat_ellipse()
library(e1071)
iris[iris$Species %in% c("virginica","setosa"),]

ocl <- svm(iris[iris$Species %in% c("virginica","setosa"),3:4],type="one-classification")
coloring <- predict(ocl,iris[,3:4],decision.values=TRUE)

ggplot(iris)+aes(x=Petal.Width,y=Petal.Length,color=coloring)+geom_point()+stat_ellipse()
ggplot(iris)+aes(x=Petal.Width,y=Petal.Length)+geom_point(color=rgb(red=0.8+0.1*attr(coloring,"decision.values"),green=rep(0,150),blue=1-(0.8+0.1*attr(coloring,"decision.values"))))

pca <- prcomp(iris[,3:4])

#pca <- prcomp(iris[iris$Species %in% c("virginica","setosa"),1:4], retx = TRUE,  scale = TRUE)
pca <- prcomp(iris[iris$Species %in% c("virginica","setosa"),1:4], retx = TRUE,  scale = TRUE,tol=0.2)
  pca <- prcomp(iris[iris$Species %in% c("virginica","versicolor"),1:4], retx = TRUE,  scale = TRUE,tol=0.4)
  predicted <-predict(pca,iris[,1:4])
  inverted <- t(t(predicted %*% t(pca$rotation)) * pca$scale + pca$center)
  ggplot(inverted[,3:4]-iris[,3:4])+aes(x=Petal.Width,y=Petal.Length,color=iris$
                                    Species)+geom_point()+stat_ellipse()
like image 127
CAFEBABE Avatar answered Oct 14 '22 03:10

CAFEBABE


There could be a number of factors that could lead to this skewed outcome:

You seem to indicate that D2 IS skewed by comparison to D1, and so the heavily skewed results could possibly be an expected outcome (Perhaps the D2 Dataset is heavily focused on a regional part of the problem space where one class is dominant). Depending on the nature of the data, this could be a valid outcome.

Perhaps D1 is overtrained on a particular class. You could try training on fewer cases in the class to encourage classification to one of the other classes to determine the outcome. I don't know how many training or testing cases that you have, but if it is large and there are more of that classes labels in the training data than the others, perhaps this could be leading to overclassification.

Perhaps you could also manipulate the training data to be closer to the means of D2 to see what impact it would have on classification. I've never tried this before though.

I hope this helps in some way.

like image 30
Matthew Spencer Avatar answered Oct 14 '22 03:10

Matthew Spencer