I have a dataset where the output is one of 46226 categories. I also have millions of samples.
But it seems that Keras/TensorFlow require one-hot encodings of the output.
Problem is, np_utils.to_categorical(y_indices,num_classes) causes an out-of-memory error because then I need a 8000 x 46226 matrix.
My working PC has a 8G Memory,when I try to execute 'numpy.zeros((8000,46226))',it works fine.But when I change my y_indices to one-hot encodings,it got the following error:
------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-9-7b9df1cf8cee> in <module>()
----> 1 Y_cat = to_categorical(Y, num_classes=nb_classes)
c:\program files\anaconda3\envs\python35\lib\site-packages\keras\utils\np_utils.py in to_categorical(y, num_classes)
22 num_classes = np.max(y) + 1
23 n = y.shape[0]
---> 24 categorical = np.zeros((n, num_classes))
25 categorical[np.arange(n), y] = 1
26 return categorical
MemoryError:
Is there any way to get Keras to solve this hinder? I would be happy to add some code if someone would point out how to best do it.
You do not actually need one-hot encoded labels, you can use integer labels with the sparse_categorical_crossentropy
loss, which accepts integer labels.
This way there should not be an out of memory error. Another alternative is to make a generator (to use with fit_generator
) and one-hot encode labels on the fly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With