Capturing high multi-collinearity in statsmodels

Tags:

Say I fit a model in statsmodels

mod = smf.ols('dependent ~ first_category + second_category + other', data=df).fit()

When I do mod.summary() I may see the following:

Warnings: [1] The condition number is large, 1.59e+05. This might indicate that there are strong multicollinearity or other numerical problems.

Sometimes the warning is different (e.g. based on eigenvalues of the design matrix). How can I capture high-multi-collinearity conditions in a variable? Is this warning stored somewhere in the model object?

Also, where can I find a description of the fields in summary()?

314

asked Sep 04 '14 22:09

Amelio Vazquez-Reina

1 Answers

You can detect high-multi-collinearity by inspecting the eigen values of correlation matrix. A very low eigen value shows that the data are collinear, and the corresponding eigen vector shows which variables are collinear.

If there is no collinearity in the data, you would expect that none of the eigen values are close to zero:

>>> xs = np.random.randn(100, 5)      # independent variables >>> corr = np.corrcoef(xs, rowvar=0)  # correlation matrix >>> w, v = np.linalg.eig(corr)        # eigen values & eigen vectors >>> w array([ 1.256 ,  1.1937,  0.7273,  0.9516,  0.8714])

However, if say x[4] - 2 * x[0] - 3 * x[2] = 0, then

>>> noise = np.random.randn(100)                      # white noise >>> xs[:,4] = 2 * xs[:,0] + 3 * xs[:,2] + .5 * noise  # collinearity >>> corr = np.corrcoef(xs, rowvar=0) >>> w, v = np.linalg.eig(corr) >>> w array([ 0.0083,  1.9569,  1.1687,  0.8681,  0.9981])

one of the eigen values (here the very first one), is close to zero. The corresponding eigen vector is:

>>> v[:,0] array([-0.4077,  0.0059, -0.5886,  0.0018,  0.6981])

Ignoring almost zero coefficients, above basically says x[0], x[2] and x[4] are colinear (as expected). If one standardizes xs values and multiplies by this eigen vector, the result will hover around zero with small variance:

>>> std_xs = (xs - xs.mean(axis=0)) / xs.std(axis=0)  # standardized values >>> ys = std_xs.dot(v[:,0]) >>> ys.mean(), ys.var() (0, 0.0083)

Note that ys.var() is basically the eigen value which was close to zero.

So, in order to capture high multi-linearity, look at the eigen values of correlation matrix.

148

answered Oct 18 '22 07:10

behzad.nouri

Related questions
                            
                                How to check if OS is Vista in Python?
                            
                                A class method which behaves differently when called as an instance method?
                            
                                changing the process name of a python script [duplicate]
                            
                                Using Django settings in templates [duplicate]
                            
                                Detect socket hangup without sending or receiving?
                            
                                Any yaml libraries in Python that support dumping of long strings as block literals or folded blocks?
                            
                                Is it possible to show the exact position in Sublime Text 2?
                            
                                Pandas: Sorting columns by their mean value
                            
                                list comprehension replace for loop in 2D matrix
                            
                                'function' object has no attribute 'name' when registering blueprint
                            
                                `ValueError: A value in x_new is above the interpolation range.` - what other reasons than not ascending values?
                            
                                Auto __repr__ method
                            
                                Django: Overriding __init__ for Custom Forms
                            
                                Display graph without saving using pydot
                            
                                What is the simplest way to swap each pair of adjoining chars in a string with Python?
                            
                                Python vs Matlab [closed]
                            
                                How to get the difference of two querysets in Django?
                            
                                csv.write skipping lines when writing to csv
                            
                                How to get the highest element in absolute value in a numpy matrix?
                            
                                Python to automatically select serial ports (for Arduino)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Capturing high multi-collinearity in statsmodels

Tags:

python

statistics

scipy

statsmodels

Amelio Vazquez-Reina

People also ask

1 Answers

behzad.nouri

Recent Activity

Donate For Us