Is there a way to check for linear dependency for columns in a pandas dataframe? For example:
columns = ['A','B', 'C']
df = pd.DataFrame(columns=columns)
df.A = [0,2,3,4]
df.B = df.A*2
df.C = [8,3,5,4]
print(df)
A B C
0 0 0 8
1 2 4 3
2 3 6 5
3 4 8 4
Is there a way to show that column B
is a linear combination of A
, but C
is an independent column? My ultimate goal is to run a poisson regression on a dataset, but I keep getting a LinAlgError: Singular matrix
error, meaning no inverse exists of my dataframe and thus it contains dependent columns.
I would like to come up with a programmatic way to check each feature and ensure there are no dependent columns.
Two vectors are linearly dependent if and only if they are collinear, i.e., one is a scalar multiple of the other. Any set containing the zero vector is linearly dependent. If a subset of { v 1 , v 2 ,..., v k } is linearly dependent, then { v 1 , v 2 ,..., v k } is linearly dependent as well.
You can basically find the vectors spanning the columnspace of the matrix by using SymPy library's columnspace() method of Matrix object. Automatically, they are the linearly independent columns of the matrix.
Initialize a col variable with column name. Create a user-defined function check() to check if a column exists in the DataFrame. Call check() method with valid column name. Call check() method with invalid column name.
Given a set of vectors, you can determine if they are linearly independent by writing the vectors as the columns of the matrix A, and solving Ax = 0. If there are any non-zero solutions, then the vectors are linearly dependent. If the only solution is x = 0, then they are linearly independent.
If you have SymPy
you could use the "reduced row echelon form" via sympy.matrix.rref
:
>>> import sympy
>>> reduced_form, inds = sympy.Matrix(df.values).rref()
>>> reduced_form
Matrix([
[1.0, 2.0, 0],
[ 0, 0, 1.0],
[ 0, 0, 0],
[ 0, 0, 0]])
>>> inds
[0, 2]
The pivot columns (stored as inds
) represent the "column numbers" that are linear independent, and you could simply "slice away" the other ones:
>>> df.iloc[:, inds]
A C
0 0 8
1 2 3
2 3 5
3 4 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With