I've implemented a basic MLP in Keras with tensorflow and I'm trying to solve a binary classification problem. For binary classification, it seems that sigmoid is the recommended activation function and I'm not quite understanding why, and how Keras deals with this.
I understand the sigmoid function will produce values in a range between 0 and 1. My understanding is that for classification problems using sigmoid, there will be a certain threshold used to determine the class of an input (typically 0.5). In Keras, I'm not seeing any way to specify this threshold, so I assume it's done implicitly in the back-end? If this is the case, how is Keras distinguishing between the use of sigmoid in a binary classification problem, or a regression problem? With binary classification, we want a binary value, but with regression a nominal value is needed. All I can see that could be indicating this is the loss function. Is that informing Keras on how to handle the data?
Additionally, assuming Keras is implicitly applying a threshold, why does it output nominal values when I use my model to predict on new data?
For example:
y_pred = model.predict(x_test)
print(y_pred)
gives:
[7.4706882e-02] [8.3481872e-01] [2.9314638e-04] [5.2297767e-03] [2.1608515e-01] ... [4.4894204e-03] [5.1120580e-05] [7.0263929e-04]
I can apply a threshold myself when predicting to get a binary output, however surely Keras must be doing that anyway in order to correctly classify? Perhaps Keras is applying a threshold when training the model, but when I use it to predict new values, the threshold isn't used as the loss function isn't used in predicting? Or is not applying a threshold at all, and the nominal values outputted happen to be working well with my model? I've checked this is happening on the Keras example for binary classification, so I don't think I've made any errors with my code, especially as it's predicting accurately.
If anyone could explain how this is working, I would greatly appreciate it.
Here's my model as a point of reference:
model = Sequential()
model.add(Dense(124, activation='relu', input_shape = (2,)))
model.add(Dropout(0.5))
model.add(Dense(124, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(loss='binary_crossentropy',
optimizer=SGD(lr = 0.1, momentum = 0.003),
metrics=['acc'])
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
Sigmoid is equivalent to a 2-element Softmax, where the second element is assumed to be zero. Therefore, sigmoid is mostly used for binary classification.
Sigmoid: It is also called as a Binary classifier or Logistic Activation function because function always pick value either 0(False) or 1 (True). The sigmoid function produces similar results to step function in that the output is between 0 and 1.
If there are two mutually exclusive classes (binary classification), then your output layer will have one node and a sigmoid activation function should be used.
Generally, we use softmax activation instead of sigmoid with the cross-entropy loss because softmax activation distributes the probability throughout each output node. But, for binary classification, we use sigmoid rather than softmax.
The output of a binary classification is the probability of a sample belonging to a class.
how is Keras distinguishing between the use of sigmoid in a binary classification problem, or a regression problem?
It does not need to. It uses the loss function to calculate the loss, then the derivatives and update the weights.
In other words:
y^
and actual y
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With