Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to check for linearly dependent columns in a dataframe?

Is there a way to check for linear dependency for columns in a pandas dataframe? For example:

columns = ['A','B', 'C']
df = pd.DataFrame(columns=columns)
df.A = [0,2,3,4]
df.B = df.A*2
df.C = [8,3,5,4]
print(df)

   A  B  C
0  0  0  8
1  2  4  3
2  3  6  5
3  4  8  4

Is there a way to show that column B is a linear combination of A, but C is an independent column? My ultimate goal is to run a poisson regression on a dataset, but I keep getting a LinAlgError: Singular matrix error, meaning no inverse exists of my dataframe and thus it contains dependent columns.

I would like to come up with a programmatic way to check each feature and ensure there are no dependent columns.

like image 596
Geoff Perrin Avatar asked Jun 14 '17 22:06

Geoff Perrin


People also ask

How is linear dependence detected?

Two vectors are linearly dependent if and only if they are collinear, i.e., one is a scalar multiple of the other. Any set containing the zero vector is linearly dependent. If a subset of { v 1 , v 2 ,..., v k } is linearly dependent, then { v 1 , v 2 ,..., v k } is linearly dependent as well.

How do you check if columns are linearly independent in Python?

You can basically find the vectors spanning the columnspace of the matrix by using SymPy library's columnspace() method of Matrix object. Automatically, they are the linearly independent columns of the matrix.

How do you check if a column is relevant in pandas?

Initialize a col variable with column name. Create a user-defined function check() to check if a column exists in the DataFrame. Call check() method with valid column name. Call check() method with invalid column name.

How do you find linearly dependent columns?

Given a set of vectors, you can determine if they are linearly independent by writing the vectors as the columns of the matrix A, and solving Ax = 0. If there are any non-zero solutions, then the vectors are linearly dependent. If the only solution is x = 0, then they are linearly independent.


1 Answers

If you have SymPy you could use the "reduced row echelon form" via sympy.matrix.rref:

>>> import sympy 
>>> reduced_form, inds = sympy.Matrix(df.values).rref()
>>> reduced_form
Matrix([
[1.0, 2.0,   0],
[  0,   0, 1.0],
[  0,   0,   0],
[  0,   0,   0]])

>>> inds
[0, 2]

The pivot columns (stored as inds) represent the "column numbers" that are linear independent, and you could simply "slice away" the other ones:

>>> df.iloc[:, inds]
   A  C
0  0  8
1  2  3
2  3  5
3  4  4
like image 200
MSeifert Avatar answered Sep 21 '22 09:09

MSeifert