Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keras Binary Classification - Sigmoid activation function

I've implemented a basic MLP in Keras with tensorflow and I'm trying to solve a binary classification problem. For binary classification, it seems that sigmoid is the recommended activation function and I'm not quite understanding why, and how Keras deals with this.

I understand the sigmoid function will produce values in a range between 0 and 1. My understanding is that for classification problems using sigmoid, there will be a certain threshold used to determine the class of an input (typically 0.5). In Keras, I'm not seeing any way to specify this threshold, so I assume it's done implicitly in the back-end? If this is the case, how is Keras distinguishing between the use of sigmoid in a binary classification problem, or a regression problem? With binary classification, we want a binary value, but with regression a nominal value is needed. All I can see that could be indicating this is the loss function. Is that informing Keras on how to handle the data?

Additionally, assuming Keras is implicitly applying a threshold, why does it output nominal values when I use my model to predict on new data?

For example:

y_pred = model.predict(x_test)
print(y_pred)

gives:

[7.4706882e-02] [8.3481872e-01] [2.9314638e-04] [5.2297767e-03] [2.1608515e-01] ... [4.4894204e-03] [5.1120580e-05] [7.0263929e-04]

I can apply a threshold myself when predicting to get a binary output, however surely Keras must be doing that anyway in order to correctly classify? Perhaps Keras is applying a threshold when training the model, but when I use it to predict new values, the threshold isn't used as the loss function isn't used in predicting? Or is not applying a threshold at all, and the nominal values outputted happen to be working well with my model? I've checked this is happening on the Keras example for binary classification, so I don't think I've made any errors with my code, especially as it's predicting accurately.

If anyone could explain how this is working, I would greatly appreciate it.

Here's my model as a point of reference:

model = Sequential()
model.add(Dense(124, activation='relu', input_shape = (2,)))
model.add(Dropout(0.5))
model.add(Dense(124, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.summary()

model.compile(loss='binary_crossentropy',
              optimizer=SGD(lr = 0.1, momentum = 0.003),
              metrics=['acc'])

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
like image 242
Daniel Whettam Avatar asked Mar 06 '18 16:03

Daniel Whettam


People also ask

Can sigmoid be used for binary classification?

Sigmoid is equivalent to a 2-element Softmax, where the second element is assumed to be zero. Therefore, sigmoid is mostly used for binary classification.

Why is sigmoid activation used for binary classification?

Sigmoid: It is also called as a Binary classifier or Logistic Activation function because function always pick value either 0(False) or 1 (True). The sigmoid function produces similar results to step function in that the output is between 0 and 1.

Which activation function is used for binary classification?

If there are two mutually exclusive classes (binary classification), then your output layer will have one node and a sigmoid activation function should be used.

Which activation function and loss function would you use for a binary classification problem?

Generally, we use softmax activation instead of sigmoid with the cross-entropy loss because softmax activation distributes the probability throughout each output node. But, for binary classification, we use sigmoid rather than softmax.


1 Answers

The output of a binary classification is the probability of a sample belonging to a class.

how is Keras distinguishing between the use of sigmoid in a binary classification problem, or a regression problem?

It does not need to. It uses the loss function to calculate the loss, then the derivatives and update the weights.

In other words:

  • During training the framework minimizes the loss. The user must specify the loss function (provided by the framework) or supply their own. The network only cares about the scalar value this function outputs and its 2 arguments are predicted y^ and actual y.
  • Each activation function implements the forward propagation and back-propagation functions. The framework is only interested in these 2 functions. It does not care what the function does exactly. The only requirement is that the activation function is non-linear.
like image 169
Maxim Egorushkin Avatar answered Sep 26 '22 22:09

Maxim Egorushkin