Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it acceptable to have the same input multiple times in machine learning (with different output)?

I was wondering whether in machine learning it is acceptable to have a dataset that may contain the same input multiple times, but each time with another (valid!) output. For instance in the case of machine translation, an input sentence but each time given a different translation.

On the one hand I would say that this is definitely acceptable, because the differences in output might better model small latent features, leading to better generalisation capabilities of the model. On the other, I fear that having the same input multiple times would bias the model for that given input - meaning that the first layers (in a deep neural network) might be "overfitted" on this input. Specifically, this can be tricky when the same input is seen multiple times in the test set, but never in the training set or vice-versa.

like image 936
Bram Vanroy Avatar asked Nov 07 '22 12:11

Bram Vanroy


1 Answers

In general you can do whatever works and this "whatever works" is also the key to answer your question. The first thing you need to do is to define a performance metric. If the to be learned function is defined as X |-> Y where X is the source sentence and Y is the target sentence, the performance measure is a function f((x,y)) -> |R and in turn can be used to define the loss function which has to be optimised by the neural network.

Let's assume for simplicity that you use accuracy, so the fraction of perfectly matched sentences. if you have conflicting examples like (x,y1) and (x,y2) then you can not reach anymore 100% accuracy which feels weird but doesn't do any harm. The other cool and important fact is that, each sentence can by definition only matched once correctly -- assuming no random component in the predictions of your NN. This means that sentences with more alternative translations are not weighted higher in building models. The advantage is that this approach might cause a bit better generalisation. On the downside this approach might cause a plateau in the loss of your optimisation which might result into a model being stuck between the optimal choice.

A much cleaner approach would be to take the fact that there are alternative translation in the definition of your performance measure/loss into account. You can define the performance metric as

\frac{1}{|D|}\sum{(x,[y_1,..,y_n])\in D 1I_{f(x)\in[y_1,...y_n]}

Where 1I is the indicator function.

This would give a cleaner metric. Obviously you need to adopt the above derivation to your target metric

like image 181
CAFEBABE Avatar answered Nov 15 '22 07:11

CAFEBABE