Why scikit-learn truncatedSVD uses 'randomized' algorithm as default?

Question

I used with truncatedSVD with 30000 by 40000 size of term-document matrix to reducing the dimension to 3000 dimension, when using 'randomized', variance ratio is about 0.5 (n_iter=10) when using 'arpack', variance ratio is about 0.9

Variance ratio of 'randomized' algorithm is lower than one of 'arpack'.

So why scikit-learn truncatedSVD uses 'randomized' algorithm as default?

Matt Krause · Accepted Answer

Speed!

According to the docs, sklearn.decomposition.TruncatedSVD can use a randomized algorithm due to Halko, Martinson, and Tropp (2009). This paper claims that their algorithm is considerably faster.

For a dense matrix, it runs in O(m*n*log(k)) time, whereas the classical algorithm takes O(m*n*k) time, where m and n are the dimensions of the matrix from which you want the kth largest components. The randomized algorithm is also easier to efficiently parallelize and makes fewer passes over the data.

Table 7.1 of the paper (on page 45) shows the performance of a few algorithms as a function of matrix size and # of components, and the randomized algorithm is often an order of magnitude faster.

The accuracy of the output is also claimed to be pretty good (Figure 7.5), though there are some modifications and constants that might affect it and I haven't gone through the sklearn code to see what they did/did not do.

Why scikit-learn truncatedSVD uses 'randomized' algorithm as default?

Tags:

scikit-learn

svd

dimension-reduction

Kyeongpil

1 Answers

Matt Krause

Recent Activity

Donate For Us

Why scikit-learn truncatedSVD uses 'randomized' algorithm as default?

Tags:

scikit-learn

svd

dimension-reduction

Kyeongpil

1 Answers

Matt Krause

Related questions

Recent Activity

Donate For Us