Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Limit neural network output to subset of trained classes

Is it possible to pass a vector to a trained neural network so it only chooses from a subset of the classes it was trained to recognize. For example, I have a network trained to recognize numbers and letters, but I know that the images I'm running it on next will not contain lowercase letters (Such as images of serial numbers). Then I pass it a vector telling it not to guess any lowercase letters. Since the classes are exclusive the network ends in a softmax function. Following are just examples of what I'd thought of trying but none really work.

import numpy as np

def softmax(arr):
    return np.exp(arr)/np.exp(arr).sum()

#Stand ins for previous layer/NN output and vector of allowed answers.
output = np.array([ 0.15885351,0.94527385,0.33977026,-0.27237907,0.32012873,
       0.44839673,-0.52375875,-0.99423903,-0.06391236,0.82529586])
restrictions = np.array([1,1,0,0,1,1,1,0,1,1])

#Ideas -----

'''First: Multilpy restricted before sending it through softmax.
I stupidly tried this one.'''
results = softmax(output*restrictions)

'''Second: Multiply the results of the softmax by the restrictions.'''
results = softmax(output)
results = results*restrictions

'''Third: Remove invalid entries before calculating the softmax.'''
result = output*restrictions
result[result != 0] = softmax(result[result != 0])

All of these have issues. The first one causes invalid choices to default to:

1/np.exp(arr).sum()

since inputs to softmax can be negative this can raise the probability given to an invalid choice and make the answer worse. (Should've looked into it before I tried it.)

The second and third both have similar issues in that they wait until right before an answer is given to apply the restriction. For example, if the network is looking at the letter l, but it starts to determine that it's the number 1, this won't be corrected until the very end with these methods. So if it was on it's way to giving the output of 1 with .80 probability but then this option removed it seems the remaining options will redistribute and the highest valid answer won't be as confident as 80%. The remaining options end up a lot more homogeneous. An example of what I'm trying to say:

output
Out[75]: array([ 5.39413513,  3.81445419,  3.75369546,  1.02716988,  0.39189373])

softmax(output)
Out[76]: array([ 0.70454877,  0.14516581,  0.13660832,  0.00894051,  0.00473658])

softmax(output[1:])
Out[77]: array([ 0.49133596,  0.46237183,  0.03026052,  0.01603169])

(Arrays were ordered to make it easier.) In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use.

I've considered passing a vector into the network earlier as another input but I'm not sure how to do this without requiring it to learn what the vector is telling it to do, which I think would increase time required to train.

EDIT: I was writing way too much in the comments so I'll just post updates here. I did eventually try giving the restrictions as an input to the network. I took the one hot-encoded answer and randomly added extra enabled classes to simulate an answer key and ensure the correct answer was always in the key. When the key had very few enabled categories the network relied heavily on it and it interfered with learning features from the image. When the key had a lot of enabled categories it seemingly ignored the key completely. This could have been a problem that needed optimized, issues with my network architecture, or just needed a tweak to training but I never got around the the solution.

I did find out that removing answers and zeroing were almost the same when I eventually subtracted np.inf instead of multiplying by 0. I was aware of ensembles but as mentioned in a comment to the first response my network was dealing with CJK characters (alphabet was just to make example easier) and had 3000+ classes. The network was already overly bulky which is why I wanted to look into this method. Using binary networks for each individual category was something I hadn't thought of but 3000+ networks seems problematic too (if I understood what you were saying correctly) though I may look into it later.

like image 330
David S. Avatar asked May 24 '17 02:05

David S.


People also ask

How many outputs should a neural network have?

There will be two outputs, one from each classifier (i.e. hidden neuron). But we are to build a single classifier with one output representing the class label, not two classifiers. As a result, the outputs of the two hidden neurons are to be merged into a single output.

How do you increase DenseNet accuracy?

For DenseNet, the same improvement was found by increasing the test accuracy from 93.20% to 94.48%. To sum up, both ResNet and DenseNet accomplished an obvious improvement by using 3 times data augmentation than using 2 times counterpart.


1 Answers

First of all, I will loosely go through available options you have listed and add some viable alternatives with the pros and cons. It's kinda hard to structure this answer but I hope you'll get what I'm trying to put out:

1. Multiply restricted before sending it through softmax.

Obviously may give higher chance to the zeroed-out entries as you have written, at seems like a false approach at the beginning.

Alternative: replace impossible values with smallest logit value. This one is similar to softmax(output[1:]), though the network will be even more uncertain about the results. Example pytorch implementation:

import torch

logits = torch.Tensor([5.39413513, 3.81445419, 3.75369546, 1.02716988, 0.39189373])
minimum, _ = torch.min(logits, dim=0)
logits[0] = minimum
print(torch.nn.functional.softmax(logits))

which yields:

tensor([0.0158, 0.4836, 0.4551, 0.0298, 0.0158])

Discussion

  • Citing you: "In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use."

Yes, and you would be in the right when doing that. Even more so, the actual probabilities for this class are actually far lower, around 14% (tensor([0.7045, 0.1452, 0.1366, 0.0089, 0.0047])). By manually changing the output you are essentially destroying the properties this NN has learned (and it's output distribution) rendering some part of your computations pointless. This points to another problem stated in the bounty this time:

2. NN are known to be overconfident for classification problems

I can imagine this being solved in multiple ways:

2.1 Ensemble

Create multiple neural networks and ensemble them by summing logits taking argmax at the end (or softmax and then `argmax). Hypothetical situation with 3 different models with different predictions:

import torch

predicted_logits_1 = torch.Tensor([5.39413513, 3.81419, 3.7546, 1.02716988, 0.39189373])
predicted_logits_2 = torch.Tensor([3.357895, 4.0165, 4.569546, 0.02716988, -0.189373])
predicted_logits_3 = torch.Tensor([2.989513, 5.814459, 3.55369546, 3.06988, -5.89473])

combined_logits = predicted_logits_1 + predicted_logits_2 + predicted_logits_3
print(combined_logits)
print(torch.nn.functional.softmax(combined_logits))

This would gives us the following probabilities after softmax:

[0.11291057 0.7576356 0.1293983 0.00005554 0.]

(notice the first class is now the most probable)

You can use bootstrap aggregating and other ensembling techniques to improve predictions. This approach makes the classifying decision surface smoother and fixes mutual errors between classifiers (given their predictions vary quite a lot). It would take many posts to describe in any greater detail (or separate question with specific problem would be needed), here or here are some which might get you started.

Still I would not mix this approach with manual selection of outputs.

2.2 Transform the problem into binary

This approach might yield better inference time and maybe even better training time if you can distribute it over multiple GPUs.

Basically, each class of yours can either be present (1) or absent (0). In principle you could train N neural networks for N classes, each outputting a single unbounded number (logit). This single number tells whether the network thinks this example should be classified as it's class or not.

If you are sure certain class won't be the outcome for sure you do not run network responsible for this class detection. After obtaining predictions from all the networks (or subset of networks), you choose the highest value (or highest probability if you use sigmoid activation, though it would be computationally wasteful).

Additional benefit would be simplicity of said networks (easier training and fine-tuning) and easy switch-like behavior if needed.

Conclusions

If I were you I would go with the approach outlined in 2.2 as you could save yourself some inference time easily and would allow you to "choose outputs" in a sensible manner.

If this approach is not enough, you may consider N ensembles of networks, so a mix of 2.2 and 2.1, some bootstrap or other ensembling techniques. This should improve your accuracy as well.

like image 139
Szymon Maszke Avatar answered Sep 22 '22 02:09

Szymon Maszke