Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

error when using Mirrored strategy in Tensorflow

I read the data and processed it using the following code :

data = pd.read_csv('Step1_output.csv')
data = data.sample(frac=1).reset_index(drop=True)
data1 = pd.DataFrame(data, columns=['Res_pair'])

# creating instance of labelencoder
labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
data1['Res_pair_ID'] = labelencoder.fit_transform(data1['Res_pair'])
data['Res_pair'] = data1['Res_pair_ID']
data = data.to_numpy()
train_X = data[0:data.shape[0],0:566]
train_y = data[0:data.shape[0],566:data.shape[1]]
train_X = train_X.reshape((train_X.shape[0], train_X.shape[1], 1))

I build the model using following code where I have tried to distribute the dataset using mirrored strategy of Tensorflow :

print("Hyper-parameter values:\n")
print('Momentum Rate =',momentum_rate,'\n')
print('learning rate =',learning_rate,'\n')
print('Number of neurons =',neurons,'\n')

  

strategy = tensorflow.distribute.MirroredStrategy()
with strategy.scope():
        model = tf.keras.Sequential([ 
          tf.keras.layers.Conv1D(64,kernel_size = 3,activation='relu',input_shape=train_X.shape[1:]),
          tf.keras.layers.Flatten(),
          tf.keras.layers.Dense(neurons,activation='relu'),
          tf.keras.layers.Dense(neurons,activation='relu'),
          tf.keras.layers.Dense(neurons,activation='relu'),
          tf.keras.layers.Dense(neurons,activation='relu'),
          tf.keras.layers.Dense(10, activation='softmax'),])
        sgd = optimizers.SGD(lr=learning_rate, decay=1e-6, momentum=momentum_rate, nesterov=True)
        model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy',tensorflow.keras.metrics.Precision()])
        results = model.fit(train_X,train_y,validation_split = 0.2,epochs=10,batch_size = 100)
        print(results)
       
    path = 'saved_model/'
    
    model.save(path, save_format='tf')

    for k in range(100):
        momentum_rate = random.random()
        learning_rate = random.uniform(0,0.2)
        neurons = random.randint(10,50)

I tried to run the code on GPU but it runs for some time and then throws this error :

Hyper-parameter values:

Momentum Rate = 0.6477407029392913

learning rate = 0.03988890117492503

Number of neurons = 35

Epoch 1/10
     1/270110 [..............................] - ETA: 28s - loss: nan - accuracy: 0.0100 - precision: 0.0100Traceback (most recent call last):
  File "parallelised_script_realdata2.py", line 56, in <module>
    results = model.fit(train_X,train_y,validation_split = 0.2,epochs=10,batch_size = 100)
  File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit
    tmp_logs = train_function(iterator) 
  File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 807, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx) 
  File "/usr/local/lib64/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  assertion failed: [predictions must be >= 0] [Condition x >= y did not hold element-wise:] [x (sequential/dense_4/Softmax:0) = ] [[nan nan nan...]...] [y (Cast_6/x:0) = ] [0]
         [[{{node assert_greater_equal/Assert/AssertGuard/else/_21/assert_greater_equal/Assert/AssertGuard/Assert}}]] [Op:__inference_train_function_1270]

Function call stack:
train_function

Update: The code works well if I don't use strategy = tensorflow.distribute.MirroredStrategy(). Like the code below (but will fail for larger datasets for memory shortage):

def convolutional_neural_network(x, y):
    print("Hyper-parameter values:\n")
    print('Momentum Rate =',momentum_rate,'\n')
    print('learning rate =',learning_rate,'\n')
    print('Number of neurons =',neurons,'\n')

    model = Sequential()
    model.add(Conv1D(filters=64,input_shape=train_X.shape[1:],activation='relu',kernel_size = 3))
    model.add(Flatten())
    model.add(Dense(neurons,activation='relu')) # first hidden layer
    model.add(Dense(neurons, activation='relu')) # second hidden layer
    model.add(Dense(neurons, activation='relu'))
    model.add(Dense(neurons, activation='relu'))
    model.add(Dense(10, activation='softmax'))
    sgd = optimizers.SGD(lr=learning_rate, decay=1e-6, momentum=momentum_rate, nesterov=True)
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy',tensorflow.keras.metrics.Precision()])

    history = model.fit(train_X, train_y, validation_split=0.2, epochs=10, batch_size=100)



momentum_rate = 0.09
learning_rate = 0.01
neurons = 40
print(convolutional_neural_network(train_X, train_y))

Update 2: Still facing a similar issue with smaller dataset

_________________________________________________________________
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv1d (Conv1D)              (None, 564, 64)           256
_________________________________________________________________
flatten (Flatten)            (None, 36096)             0
_________________________________________________________________
dense (Dense)                (None, 50)                1804850
_________________________________________________________________
dense_1 (Dense)              (None, 50)                2550
_________________________________________________________________
dense_2 (Dense)              (None, 50)                2550
_________________________________________________________________
dense_3 (Dense)              (None, 50)                2550
_________________________________________________________________
dense_4 (Dense)              (None, 10)                510
=================================================================
Total params: 1,813,266
Trainable params: 1,813,266
Non-trainable params: 0
like image 692
shome Avatar asked Nov 30 '20 22:11

shome


1 Answers

The model definition seems fine, so does the strategy.
Can you just verify train_Y for sanity check? Mostly I'm sure the error lies there.

If that's not the case, try running model.fit and latter ones outside the scope.

like image 140
Himanshu Tanwar Avatar answered Nov 14 '22 06:11

Himanshu Tanwar