In sklearn.decomposition.PCA, why are components_ negative?

Tags:

I'm trying to follow along with Abdi & Williams - Principal Component Analysis (2010) and build principal components through SVD, using numpy.linalg.svd.

When I display the components_ attribute from a fitted PCA with sklearn, they're of the exact same magnitude as the ones that I've manually computed, but some (not all) are of opposite sign. What's causing this?

Update: my (partial) answer below contains some additional info.

Take the following example data:

from pandas_datareader.data import DataReader as dr
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

# sample data - shape (20, 3), each column standardized to N~(0,1)
rates = scale(dr(['DGS5', 'DGS10', 'DGS30'], 'fred', 
           start='2017-01-01', end='2017-02-01').pct_change().dropna())

# with sklearn PCA:
pca = PCA().fit(rates)
print(pca.components_)
[[-0.58365629 -0.58614003 -0.56194768]
 [-0.43328092 -0.36048659  0.82602486]
 [-0.68674084  0.72559581 -0.04356302]]

# compare to the manual method via SVD:
u, s, Vh = np.linalg.svd(np.asmatrix(rates), full_matrices=False)
print(Vh)
[[ 0.58365629  0.58614003  0.56194768]
 [ 0.43328092  0.36048659 -0.82602486]
 [-0.68674084  0.72559581 -0.04356302]]

# odd: some, but not all signs reversed
print(np.isclose(Vh, -1 * pca.components_))
[[ True  True  True]
 [ True  True  True]
 [False False False]]

474

asked Jun 26 '17 17:06

Brad Solomon

2 Answers

As you figured out in your answer, the results of a singular value decomposition (SVD) are not unique in terms of singular vectors. Indeed, if the SVD of X is \sum_1^r \s_i u_i v_i^\top : enter image description here

with the s_i ordered in decreasing fashion, then you can see that you can change the sign (i.e., "flip") of say u_1 and v_1, the minus signs will cancel so the formula will still hold.

This shows that the SVD is unique up to a change in sign in pairs of left and right singular vectors.

Since the PCA is just a SVD of X (or an eigenvalue decomposition of X^\top X), there is no guarantee that it does not return different results on the same X every time it is performed. Understandably, scikit learn implementation wants to avoid this: they guarantee that the left and right singular vectors returned (stored in U and V) are always the same, by imposing (which is arbitrary) that the largest coefficient of u_i in absolute value is positive.

As you can see reading the source: first they compute U and V with linalg.svd(). Then, for each vector u_i (i.e, row of U), if its largest element in absolute value is positive, they don't do anything. Otherwise, they change u_i to - u_i and the corresponding left singular vector, v_i, to - v_i. As told earlier, this does not change the SVD formula since the minus sign cancel out. However, now it is guaranteed that the U and V returned after this processing are always the same, since the indetermination on the sign has been removed.

100

answered Sep 28 '22 00:09

P. Camilleri

After some digging, I've cleared up some, but not all, of my confusion on this. This issue has been covered on stats.stackexchange here. The mathematical answer is that "PCA is a simple mathematical transformation. If you change the signs of the component(s), you do not change the variance that is contained in the first component." However, in this case (with sklearn.PCA), the source of ambiguity is much more specific: in the source (line 391) for PCA you have:

U, S, V = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, V = svd_flip(U, V)

components_ = V

svd_flip, in turn, is defined here. But why the signs are being flipped to "ensure a deterministic output," I'm not sure. (U, S, V have already been found at this point...). So while sklearn's implementation is not incorrect, I don't think it's all that intuitive. Anyone in finance who is familiar with the concept of a beta (coefficient) will know that the first principal component is most likely something similar to a broad market index. Problem is, the sklearn implementation will get you strong negative loadings to that first principal component.

My solution is a dumbed-down version that does not implement svd_flip. It's pretty barebones in that it doesn't have sklearn parameters such as svd_solver, but does have a number of methods specifically geared towards this purpose.

answered Sep 27 '22 22:09

Brad Solomon

Related questions
                            
                                Can I get a view of a numpy array at specified indexes? (a view from "fancy indexing")
                            
                                Convert HTML string to an image in Python [closed]
                            
                                Library or tool to download multiple files in parallel [closed]
                            
                                How do I apply a DCT to an image in Python?
                            
                                Django's Double Underscore
                            
                                SqlAlchemy won't accept datetime.datetime.now value in a DateTime column
                            
                                How to train a neural network to supervised data set using pybrain black-box optimization?
                            
                                Scipy: lognormal fitting
                            
                                Writing wav file in Python with wavfile.write from SciPy
                            
                                Python windows path slash [duplicate]
                            
                                What is the point of .ix indexing for pandas Series
                            
                                Utility of parameter 'out' in numpy functions
                            
                                Mock side effect only X number of times
                            
                                What to do if I want 3D spline/smooth interpolation of random unstructured data?
                            
                                ImportError: No module named extern
                            
                                What is the best way to remove accents with Apache Spark dataframes in PySpark?
                            
                                How to deal with "divide by zero" with pandas dataframes when manipulating columns? [duplicate]
                            
                                How can I print Hindi sentences(unicode) on image in Python?
                            
                                aiohttp+sqlalchemy: Can't reconnect until invalid transaction is rolled back
                            
                                Why is reading multiple files at the same time slower than reading sequentially?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In sklearn.decomposition.PCA, why are components_ negative?

Tags:

python

python-3.x

numpy

scikit-learn

pca

Brad Solomon

People also ask

2 Answers

P. Camilleri

Brad Solomon

Recent Activity

Donate For Us