My data can be viewed as a matrix of 10B entries (100M x 100), which is very sparse (< 1/100 * 1/100 of entries are non-zero). I would like to feed the data into into a Keras Neural Network model which I have made, using a Tensorflow backend.
My first thought was to expand the data to be dense, that is, write out all 10B entries into a series of CSVs, with most entries zero. However, this is quickly overwhelming my resources (even doing the ETL overwhelmed pandas and is causing postgres to struggle). So I need to use true sparse matrices.
How can I do that with Keras (and Tensorflow)? While numpy doesn't support sparse matrices, scipy and tensorflow both do. There's lots of discussion (e.g. https://github.com/fchollet/keras/pull/1886 https://github.com/fchollet/keras/pull/3695/files https://github.com/pplonski/keras-sparse-check https://groups.google.com/forum/#!topic/keras-users/odsQBcNCdZg ) about this idea - either using scipy's sparse matrixcs or going directly to Tensorflow's sparse matrices. But I can't find a clear conclusion, and I haven't been able to get anything to work (or even know clearly which way to go!).
How can I do this?
I believe there are two possible approaches:
I also think #2 is preferred, because you'll get much better performance all the way through (I believe), but #1 is probably easier and will be adequate. I'll be happy with either.
How can either be implemented?
You can pass sparse tensors between Keras layers, and also have Keras models return them as outputs. If you use sparse tensors in tf. keras.
A sparse tensor is a dataset in which most of the entries are zero, one such example would be a large diagonal matrix. (which has many zero elements). It does not store the whole values of the tensor object but stores the non-zero values and the corresponding coordinates of them.
Dense tensors store values in a contiguous sequential block of memory where all values are represented. Tensors or multi-dimensional arrays are used in a diverse set of multi-dimensional data analysis applications.
tf. where will return the indices of condition that are non-zero, in the form of a 2-D tensor with shape [n, d] , where n is the number of non-zero elements in condition ( tf. count_nonzero(condition) ), and d is the number of axes of condition ( tf. rank(condition) ).
Sorry, don't have the reputation to comment, but I think you should take a look at the answer here: Keras, sparse matrix issue. I have tried it and it works correctly, just one note though, at least in my case, the shuffling led to really bad results, so I used this slightly modified non-shuffled alternative:
def nn_batch_generator(X_data, y_data, batch_size): samples_per_epoch = X_data.shape[0] number_of_batches = samples_per_epoch/batch_size counter=0 index = np.arange(np.shape(y_data)[0]) while 1: index_batch = index[batch_size*counter:batch_size*(counter+1)] X_batch = X_data[index_batch,:].todense() y_batch = y_data[index_batch] counter += 1 yield np.array(X_batch),y_batch if (counter > number_of_batches): counter=0
It produces comparable accuracies to the ones achieved by keras's shuffled implementation (setting shuffle=True
in fit
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With