I've been building a programming language detector, i.e., a classifier of code snippets, as part of a bigger project. My baseline model is pretty straight-forward: tokenize the input and encode the snippets as bag-of-words or, in this case, bag-of-tokens, and make a simple NN on top of these features.
The input to NN is a fixed-length array of counters of most distinctive tokens, such as "def"
,
"self"
, "function"
, "->"
, "const"
, "#include"
, etc., that are automatically extracted from the corpus.
The idea is that these tokens are pretty unique to programming languages, so even this naive approach should get
high accuracy score.
Input:
def 1
for 2
in 2
True 1
): 3
,: 1
...
Output: python
I got 99% accuracy pretty quickly and decided that's the sign that it works just as expected. Here's the model (a full runnable script is here):
# Placeholders
x = tf.placeholder(shape=[None, vocab_size], dtype=tf.float32, name='x')
y = tf.placeholder(shape=[None], dtype=tf.int32, name='y')
training = tf.placeholder_with_default(False, shape=[], name='training')
# One hidden layer with dropout
reg = tf.contrib.layers.l2_regularizer(0.01)
hidden1 = tf.layers.dense(x, units=96, kernel_regularizer=reg,
activation=tf.nn.elu, name='hidden1')
dropout1 = tf.layers.dropout(hidden1, rate=0.2, training=training, name='dropout1')
# Output layer
logits = tf.layers.dense(dropout1, units=classes, kernel_regularizer=reg,
activation=tf.nn.relu, name='logits')
# Cross-entropy loss
loss = tf.reduce_mean(
tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, abels=y))
# Misc reports: accuracy, correct/misclassified samples, etc.
correct_predicted = tf.nn.in_top_k(logits, y, 1, name='in-top-k')
prediction = tf.argmax(logits, axis=1)
wrong_predicted = tf.logical_not(correct_predicted, name='not-in-top-k')
x_misclassified = tf.boolean_mask(x, wrong_predicted, name='misclassified')
accuracy = tf.reduce_mean(tf.cast(correct_predicted, tf.float32), name='accuracy')
The output is pretty encouraging:
iteration=5 loss=2.580 train-acc=0.34277
iteration=10 loss=2.029 train-acc=0.69434
iteration=15 loss=2.054 train-acc=0.92383
iteration=20 loss=1.934 train-acc=0.98926
iteration=25 loss=1.942 train-acc=0.99609
Files.VAL mean accuracy = 0.99121 <-- After just 1 epoch!
iteration=30 loss=1.943 train-acc=0.99414
iteration=35 loss=1.947 train-acc=0.99512
iteration=40 loss=1.946 train-acc=0.99707
iteration=45 loss=1.946 train-acc=0.99609
iteration=50 loss=1.944 train-acc=0.99902
iteration=55 loss=1.946 train-acc=0.99902
Files.VAL mean accuracy = 0.99414
Test accuracy was also around 1.0. Everything looked perfect.
But then I noticed that I put activation=tf.nn.relu
into the final dense layer (logits
), which is clearly a bug:
there is no need to discard negative scores before softmax
, because they indicate the classes with low probability.
Zero threshold will only make these classes artificially more probable, which would be a mistake. Getting rid of it should only make the model more robust and confident in the correct class.
That's what I thought.
So I replaced it with activation=None
, run the model again and then a surprising thing happened:
the performance didn't improve. At all. In fact, it degraded significantly:
iteration=5 loss=5.236 train-acc=0.16602
iteration=10 loss=4.068 train-acc=0.18750
iteration=15 loss=3.110 train-acc=0.37402
iteration=20 loss=5.149 train-acc=0.14844
iteration=25 loss=2.880 train-acc=0.18262
Files.VAL mean accuracy = 0.28711
iteration=30 loss=3.136 train-acc=0.25781
iteration=35 loss=2.916 train-acc=0.22852
iteration=40 loss=2.156 train-acc=0.39062
iteration=45 loss=1.777 train-acc=0.45312
iteration=50 loss=2.726 train-acc=0.33105
Files.VAL mean accuracy = 0.29362
The accuracy got better with training, but never surpassed 91-92%. I changed the activation back and forth several times, varying different parameters (layer size, dropout, regularizer, extra layers, anything) and always had the same outcome: the "wrong" model hit 99% immediately, while the "right" model barely achieved 90% after 50 epochs. According to tensorboard, there was no big difference in weight distribution: the gradients didn't die out and both models learned normally.
How is this possible? How can the final ReLu make a model so much superior? Especially if this ReLu is a bug?
After playing around with it for a while, I decided to visualize the actual prediction distribution for both models:
predicted_distribution = tf.nn.softmax(logits, name='distribution')
Below are the histograms of the distributions and how they evolved over time.
With ReLu (wrong model)
Without ReLu (correct model)
The first histogram makes sense, most of probabilities are close to 0
.
But the histogram of the ReLu model is suspicious: the values seem to concentrate around 0.15
after few iterations. Printing the actual predictions confirmed this idea:
[0.14286 0.14286 0.14286 0.14286 0.14286 0.14286 0.14286]
[0.14286 0.14286 0.14286 0.14286 0.14286 0.14286 0.14286]
I had 7 classes (for 7 different languages at that moment) and 0.14286
is 1/7
. It turns out, the "perfect" model learned to output
0
logits, which in turn translated in uniform prediction.
But how can this distribution be reported as 99% accurate?
tf.nn.in_top_k
Before diving into tf.nn.in_top_k
I checked an alternative way to compute accuracy:
true_correct = tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64))
alternative_accuracy = tf.reduce_mean(tf.cast(true_correct, tf.float32))
... which performs honest comparison of the highest predicted class and the ground truth. The result is this:
iteration=2 loss=3.992 train-acc=0.13086 train-alt-acc=0.13086
iteration=4 loss=3.590 train-acc=0.13086 train-alt-acc=0.12207
iteration=6 loss=2.871 train-acc=0.21777 train-alt-acc=0.13672
iteration=8 loss=2.466 train-acc=0.37695 train-alt-acc=0.16211
iteration=10 loss=2.099 train-acc=0.62305 train-alt-acc=0.10742
iteration=12 loss=2.066 train-acc=0.79980 train-alt-acc=0.17090
iteration=14 loss=2.016 train-acc=0.84277 train-alt-acc=0.17285
iteration=16 loss=1.954 train-acc=0.91309 train-alt-acc=0.13574
iteration=18 loss=1.956 train-acc=0.95508 train-alt-acc=0.06445
iteration=20 loss=1.923 train-acc=0.97754 train-alt-acc=0.11328
Indeed, tf.nn.in_top_k
with k=1
diverged from the right accuracy quickly and began to report fantasized 99% values.
So what does it do actually? Here's what the documentation
says about it:
Says whether the targets are in the top K predictions.
This outputs a
batch_size
bool array, an entryout[i]
is true if the prediction for the target class is among the top k predictions among all predictions for example i. Note that the behavior ofInTopK
differs from theTopK
op in its handling of ties; if multiple classes have the same prediction value and straddle the top-k boundary, all of those classes are considered to be in the top k.
That's what it is. If the probabilities are uniform (which actually means "I have no idea"), they are all correct. The situation is even worse, because if the logits distribution is almost uniform, softmax may transform it into exactly uniform distribution, as can be seen in this simple example:
x = tf.constant([0, 1e-8, 1e-8, 1e-9])
tf.nn.softmax(x).eval()
# >>> array([0.25, 0.25, 0.25, 0.25], dtype=float32)
... which means that every nearly uniform prediction, may be considered "correct" according to tf.nn.in_top_k
spec.
tf.nn.in_top_k
is a dangerous choice of accuracy measure in tensorflow, because it may silently swallow wrong predictions
and report them as "correct". Instead, you should always use this long but trusted expression:
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64)), tf.float32))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With