Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to prevent numpy.linalg.svd running out of memory?

I have 1 million 3d points I am passing to numpy.linalg.svd but it runs out of memory very quickly. Is there a way to break down this operation into smaller chunks?

I don't know what it's doing but am I only supposed to pass arrays that represent a 3x3, 4x4 matrix? Because I have seen uses of it online where they were passing arrays with arbitrary number of elements.

like image 656
Joan Venge Avatar asked Nov 11 '22 19:11

Joan Venge


1 Answers

If you have an MxN in your case 1000000x3 matrixnumpy.linalg.svd does not require M==N. In fact this is precisely where the SVD can come in to compute things like rank and pseudo inverse. Methods such as linalg.inv require a square (and full rank) matrix to have a defined result.

@Saullo Castro is right on. The full_matrices=False can turn intractable to tractable because instead of the U matrix being 1Mx1M elements, it is 1Mx3, a huge savings. I am not sure which reduced SVD algorithm numpy uses (I think it might be the Compact SVD, or thin): a brief description of 3 widely used ones is on wikipedia: http://en.wikipedia.org/wiki/Singular_value_decomposition in the Reduced SVDs section. They all center around reducing the computation of the full U matrix, to a reduced form and this is sufficient for some, and perhaps even many, problems. The savings is most when numberObservations>>numberFeatures

Regarding whether you get the same result. The short answer can be yes depending on exactly what you will be doing with the SVD results. For example you will get the same matrix back (to a floating point tolerance level) with the reduced form as with the original, as can be seen in the below code. Note in the top case the size of U is numberObservations x numberObservations, whereas in the full_matrices=False, the size of U is numberObservations x numberFeatures

This code was adapted from the numpy.linalg.svd doc to allow one to experiment with arbitrary rows/columns, singular values to select.

One can always reduce the size of the U matrix to M x min(M,N). Further reductions may be possible depending on the structure of your data and how much noise is present. Just because the numpy.isclose is false does not mean the computed SV's are bad for all contexts. You can experiment with this using the mostSignificantSingularValues variable, which takes the top SV's from the full SVD.

numberObservations = 900
numberFeatures = 600
mostSignificantSingularValues = 600

a = np.random.randn( numberObservations, numberFeatures) + 1j*np.random.randn(numberObservations, numberFeatures)

#Reconstruction based on full SVD:

U, s, V = np.linalg.svd(a, full_matrices=True)
print(U.shape, V.shape, s.shape)
S = np.zeros((numberObservations, numberFeatures), dtype=complex)
S[:mostSignificantSingularValues, :mostSignificantSingularValues] = np.diag(s[:mostSignificantSingularValues])
print(np.allclose(a, np.dot(U, np.dot(S, V))))
d1 = a - np.dot(U, np.dot(S, V))#
#True

#Reconstruction based on reduced SVD:

U, s, V = np.linalg.svd(a, full_matrices=False)
print(U.shape, V.shape, s.shape)
S = np.diag(s)
print(np.allclose(a, np.dot(U, np.dot(S, V))))

d2 = a - np.dot(U, np.dot(S, V))#
like image 56
Paul Avatar answered Nov 15 '22 04:11

Paul