Is there a way to check for linear dependency for columns in a pandas dataframe? For example: <pre class="prettyprint"><code>columns = ['A','B', 'C'] df = pd.DataFrame(columns=columns) df.A = [0,2,3,4] df.B = df.A*2 df.C = [8,3,5,4] print(df) A B C 0 0 0 8 1 2 4 3 2 3 6 5 3 4 8 4 </code></pre> Is there a way to show that column <code>B</code> is a linear combination of <code>A</code>, but <code>C</code> is an independent column? My ultimate goal is to run a poisson regression on a dataset, but I keep getting a <code>LinAlgError: Singular matrix</code> error, meaning no inverse exists of my dataframe and thus it contains dependent columns. I would like to come up with a programmatic way to check each feature and ensure there are no dependent columns.

If you have <code>SymPy</code> you could use the "reduced row echelon form" via <code>sympy.matrix.rref</code>: <pre class="prettyprint"><code>>>> import sympy >>> reduced_form, inds = sympy.Matrix(df.values).rref() >>> reduced_form Matrix([ [1.0, 2.0, 0], [ 0, 0, 1.0], [ 0, 0, 0], [ 0, 0, 0]]) >>> inds [0, 2] </code></pre> The pivot columns (stored as <code>inds</code>) represent the "column numbers" that are linear independent, and you could simply "slice away" the other ones: <pre class="prettyprint"><code>>>> df.iloc[:, inds] A C 0 0 8 1 2 3 2 3 5 3 4 4 </code></pre>

Is there a way to check for linearly dependent columns in a dataframe?

Tags:

python

pandas

dataframe

linear-algebra

Is there a way to check for linear dependency for columns in a pandas dataframe? For example:

columns = ['A','B', 'C']
df = pd.DataFrame(columns=columns)
df.A = [0,2,3,4]
df.B = df.A*2
df.C = [8,3,5,4]
print(df)

   A  B  C
0  0  0  8
1  2  4  3
2  3  6  5
3  4  8  4

Is there a way to show that column B is a linear combination of A, but C is an independent column? My ultimate goal is to run a poisson regression on a dataset, but I keep getting a LinAlgError: Singular matrix error, meaning no inverse exists of my dataframe and thus it contains dependent columns.

I would like to come up with a programmatic way to check each feature and ensure there are no dependent columns.

596

asked Jun 14 '17 22:06

Geoff Perrin

1 Answers

If you have SymPy you could use the "reduced row echelon form" via sympy.matrix.rref:

>>> import sympy 
>>> reduced_form, inds = sympy.Matrix(df.values).rref()
>>> reduced_form
Matrix([
[1.0, 2.0,   0],
[  0,   0, 1.0],
[  0,   0,   0],
[  0,   0,   0]])

>>> inds
[0, 2]

The pivot columns (stored as inds) represent the "column numbers" that are linear independent, and you could simply "slice away" the other ones:

>>> df.iloc[:, inds]
   A  C
0  0  8
1  2  3
2  3  5
3  4  4

200

answered Sep 21 '22 09:09

MSeifert

Related questions
                            
                                How can I use the index array in tensorflow?
                            
                                Are strings cached?
                            
                                How to interpret the upper/lower bound of a datapoint with confidence intervals?
                            
                                python - lxml how to get children of element by tag name?
                            
                                matplotlib latex in legend label vs in axis label
                            
                                What's the difference between './' and '../' when using os.path.isdir()?
                            
                                Django migrations not detecting all changes
                            
                                why am i getting error when importing AudioSegment?
                            
                                Linear regression with tensorflow
                            
                                Transform datetime in YYYY-MM-DD HH:MM[:SS[.SSSSSS]]
                            
                                Plot pandas DataFrame against month
                            
                                How to use np.save to save files in different directory in python?
                            
                                Pandas dataframe columns of lists to numpy arrays for each column
                            
                                Making a table in Python 3(beginner)
                            
                                Type annotation style (to space or not to space)
                            
                                Retrieve Decision Boundary Lines (x,y coordinate format) from SKlearn Decision Tree
                            
                                Finding the position of words in a string [duplicate]
                            
                                Mocking a return value which is an object
                            
                                pythonVSCode, venv and pylint
                            
                                Python Embeddable Zip File Doesn't Include lib/site-packages in sys.path

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With