Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does sigmoid & crossentropy of Keras/tensorflow have low precision?

I have the following simple neural network (with 1 neuron only) to test the computation precision of sigmoid activation & binary_crossentropy of Keras:

model = Sequential()
model.add(Dense(1, input_dim=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

To simplify the test, I manually set the only weight to 1 and bias to 0, and then evaluate the model with 2-point training set {(-a, 0), (a, 1)}, i.e.

y = numpy.array([0, 1])
for a in range(40):
    x = numpy.array([-a, a])
    keras_ce[a] = model.evaluate(x, y)[0] # cross-entropy computed by keras/tensorflow
    my_ce[a] = np.log(1+exp(-a)) # My own computation

My Question: I found the binary crossentropy (keras_ce) computed by Keras/Tensorflow reach a floor of 1.09e-7 when a is approx. 16, as illustrated below (blue line). It doesn't decrease further as 'a' keeps growing. Why is that?

enter image description here

This neural network has 1 neuron only whose weight is set to 1 and bias is 0. With the 2-point training set {(-a, 0), (a, 1)}, the binary_crossentropy is just

-1/2 [ log(1 - 1/(1+exp(a)) ) + log( 1/(1+exp(-a)) ) ] = log(1+exp(-a))

So the cross-entropy should decrease as a increases, as illustrated in orange ('my') above. Is there some Keras/Tensorflow/Python setup I can change to increase its precision? Or am I mistaken somewhere? I'd appreciate any suggestions/comments/answers.

like image 255
syeh_106 Avatar asked Sep 01 '18 07:09

syeh_106


People also ask

Why do we use sigmoid function?

The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice. The function is differentiable.

Why does the sigmoid function use e?

In case of the logistic function in particular, choosing e as the base means that for large negative y we have P(y)≈ey and so the derivative of P(y) is very close to P(y) itself. This makes it simple to contrast logistic growth with unbounded exponential growth y↦a⋅ey.

Why is sigmoid used for probability?

sigmoid(z) will yield a value (a probability) between 0 and 1. Source yes 2 - The "output" must come from a function that satisfies the properties of a distribution function in order for us to interpret it as probabilities. (...) The "sigmoid function" satisfies these properties.

Why do we use sigmoid in logistic regression?

What is the Sigmoid Function? In order to map predicted values to probabilities, we use the Sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.


2 Answers

TL;DR version: the probability values (i.e. the outputs of sigmoid function) are clipped due to numerical stability when computing the loss function.


If you inspect the source code, you would find that using binary_crossentropy as the loss would result in a call to binary_crossentropy function in losses.py file:

def binary_crossentropy(y_true, y_pred):
    return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)

which in turn, as you can see, calls the equivalent backend function. In case of using Tensorflow as the backend, that would result in a call to binary_crossentropy function in tensorflow_backend.py file:

def binary_crossentropy(target, output, from_logits=False):
    """ Docstring ..."""

    # Note: tf.nn.sigmoid_cross_entropy_with_logits
    # expects logits, Keras expects probabilities.
    if not from_logits:
        # transform back to logits
        _epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
        output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
        output = tf.log(output / (1 - output))

    return tf.nn.sigmoid_cross_entropy_with_logits(labels=target,
                                                   logits=output)

As you can see from_logits argument is set to False by default. Therefore, the if condition evaluates to true and as a result the values in the output are clipped to the range [epsilon, 1-epislon]. That's why no matter how small or large a probability is, it could not be smaller than epsilon and greater than 1-epsilon. And that explains why the output of binary_crossentropy loss is also bounded.

Now, what is this epsilon here? It is a very small constant which is used for numerical stability (e.g. prevent division by zero or undefined behaviors, etc.). To find out its value you can further inspect the source code and you would find it in the common.py file:

_EPSILON = 1e-7

def epsilon():
    """Returns the value of the fuzz factor used in numeric expressions.
    # Returns
        A float.
    # Example
    ```python
        >>> keras.backend.epsilon()
        1e-07
    ```
    """
    return _EPSILON

If for any reason, you would like more precision you can alternatively set the epsilon value to a smaller constant using set_epsilon function from the backend:

def set_epsilon(e):
    """Sets the value of the fuzz factor used in numeric expressions.
    # Arguments
        e: float. New value of epsilon.
    # Example
    ```python
        >>> from keras import backend as K
        >>> K.epsilon()
        1e-07
        >>> K.set_epsilon(1e-05)
        >>> K.epsilon()
        1e-05
    ```
    """
    global _EPSILON
    _EPSILON = e

However, be aware that setting epsilon to an extremely low positive value or zero, may disrupt the stability of computations all over the Keras.

like image 72
today Avatar answered Sep 19 '22 17:09

today


I think that keras take into account numerical stability, Let's track how keras caculate

First,

def binary_crossentropy(y_true, y_pred):
    return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)

Then,

def binary_crossentropy(target, output, from_logits=False):
    """Binary crossentropy between an output tensor and a target tensor.

    # Arguments
        target: A tensor with the same shape as `output`.
        output: A tensor.
        from_logits: Whether `output` is expected to be a logits tensor.
            By default, we consider that `output`
            encodes a probability distribution.

    # Returns
        A tensor.
    """
    # Note: tf.nn.sigmoid_cross_entropy_with_logits
    # expects logits, Keras expects probabilities.
    if not from_logits:
        # transform back to logits
        _epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
        output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
        output = tf.log(output / (1 - output))


    return tf.nn.sigmoid_cross_entropy_with_logits(labels=target,
                                                   logits=output)

Notice tf.clip_by_value is used for numerical stability

Let's compare keras binary_crossentropy, tensorflow tf.nn.sigmoid_cross_entropy_with_logits and custom loss function(eleminate vale clipping)

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
import keras

# keras
model = Sequential()
model.add(Dense(units=1, activation='sigmoid', input_shape=(
    1,), weights=[np.ones((1, 1)), np.zeros(1)]))
# print(model.get_weights())
model.compile(loss='binary_crossentropy',
              optimizer='adam', metrics=['accuracy'])

# tensorflow
G = tf.Graph()
with G.as_default():
    x_holder = tf.placeholder(dtype=tf.float32, shape=(2,))
    y_holder = tf.placeholder(dtype=tf.float32, shape=(2,))
    entropy = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(
        logits=x_holder, labels=y_holder))
sess = tf.Session(graph=G)


# keras with custom loss function
def customLoss(target, output):
    # if not from_logits:
    #     # transform back to logits
    #     _epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
    #     output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
    #     output = tf.log(output / (1 - output))
    output = tf.log(output / (1 - output))
    return tf.nn.sigmoid_cross_entropy_with_logits(labels=target,
                                                   logits=output)
model_m = Sequential()
model_m.add(Dense(units=1, activation='sigmoid', input_shape=(
    1,), weights=[np.ones((1, 1)), np.zeros(1)]))
# print(model.get_weights())
model_m.compile(loss=customLoss,
                optimizer='adam', metrics=['accuracy'])


N = 100
xaxis = np.linspace(10, 20, N)
keras_ce = np.zeros(N)
tf_ce = np.zeros(N)
my_ce = np.zeros(N)
keras_custom = np.zeros(N)

y = np.array([0, 1])
for i, a in enumerate(xaxis):
    x = np.array([-a, a])
    # cross-entropy computed by keras/tensorflow
    keras_ce[i] = model.evaluate(x, y)[0]
    my_ce[i] = np.log(1+np.exp(-a))  # My own computation
    tf_ce[i] = sess.run(entropy, feed_dict={x_holder: x, y_holder: y})
    keras_custom[i] = model_m.evaluate(x, y)[0]
# print(model.get_weights())

plt.plot(xaxis, keras_ce, label='keras')
plt.plot(xaxis, my_ce, 'b',  label='my_ce')
plt.plot(xaxis, tf_ce, 'r:', linewidth=5, label='tensorflow')
plt.plot(xaxis, keras_custom, '--', label='custom loss')
plt.xlabel('a')
plt.ylabel('xentropy')
plt.yscale('log')
plt.legend()
plt.savefig('compare.jpg')
plt.show()

we can see that tensorflow is same with manual computing, but keras with custom loss encounter numeric overflow as expected. enter image description here

like image 44
BugKiller Avatar answered Sep 19 '22 17:09

BugKiller