Keras mixture of models



Is it possible to implement MLP mixture of expert methodology in Keras? Could you please guide me by a simple code in Keras for a binary problem with 2 experts.

It needs to define a cost function like this:

g = gate.layers[-1].output
o1 = mlp1.layers[-1].output
o2 = mlp2.layers[-1].output

def ME_objective(y_true, y_pred):
    A = g[0] * T.exp(-0.5*T.sqr(y_true – o1))
    B = g[1] * T.exp(-0.5*T.sqr(y_true – o2))
    return -T.log((A+B).sum())  # cost
You can definitely model such a structure in Keras, with a merge layer, which enables you to combine different inputs. Here is a SSCCE that you'll hopefully be able to adapt to your structure

import numpy as np
from keras.engine import Merge
from keras.models import Sequential
from keras.layers import Dense
import keras.backend as K

xdim = 4
ydim = 1
gate = Sequential([Dense(2, input_dim=xdim)])
mlp1 = Sequential([Dense(1, input_dim=xdim)])
mlp2 = Sequential([Dense(1, input_dim=xdim)])

def merge_mode(branches):
    g, o1, o2 = branches
    # I'd have liked to write
    # return o1 * K.transpose(g[:, 0]) + o2 * K.transpose(g[:, 1])
    # but it doesn't work, and I don't know enough Keras to solve it
    return K.transpose(K.transpose(o1) * g[:, 0] + K.transpose(o2) * g[:, 1])

model = Sequential()
model.add(Merge([gate, mlp1, mlp2], output_shape=(ydim,), mode=merge_mode))
model.compile(optimizer='Adam', loss='mean_squared_error')

train_size = 19
nb_inputs = 3  # one input tensor for each branch (g, o1, o2)
x_train = [np.random.random((train_size, xdim)) for _ in range(nb_inputs)]
y_train = np.random.random((train_size, ydim))
model.fit(x_train, y_train)

Custom Objective

Here is an implementation of the objective you described. There are a few mathematical concerns to keep in mind though (see below).

def me_loss(y_true, y_pred):
    g = gate.layers[-1].output
    o1 = mlp1.layers[-1].output
    o2 = mlp2.layers[-1].output
    A = g[:, 0] * K.transpose(K.exp(-0.5 * K.square(y_true - o1)))
    B = g[:, 1] * K.transpose(K.exp(-0.5 * K.square(y_true - o2)))
    return -K.log(K.sum(A+B))

# [...] edit the compile line from above example
model.compile(optimizer='Adam', loss=me_loss)

Some Math

Short version: somewhere in your model, I think there should be at least one constraint (maybe two):

For any x, sum(g(x)) = 1

For any x, g0(x) > 0 and g1(x) > 0 # might not be strictly necessary

Domain study

  1. If o1(x) and o2(x) are infinitely far from y:

    • the exp term tends toward +0
    • A -> B -> +-0 depending on g0(x) and g1(x) signs
    • cost -> +infinite or nan
  2. If o1(x) and o2(x) are infinitely close to y:

    • the exp term tends toward 1
    • A -> g0(x) and B -> g1(x)
    • cost -> -log(sum(g(x)))

The problem is that log is only defined on ]0, +inf[. Which means that for the objective to be always defined, there needs to be a constraint somewhere ensuring sum(A(x) + B(x)) > 0 for any x. A more restrictive version of that constraint would be (g0(x) > 0 and g1(x) > 0).


An even more important concern here is that this objective does not seem to be designed to converge towards 0. When mlp1 and mlp2 start predicting y correctly (case 2.), there is currently nothing preventing the optimizer to make sum(g(x)) tend towards +infinite, to make loss tend towards -inifinite.

Ideally, we'd like loss -> 0, i.e. sum(g(x)) -> 1

