Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How can one list pairs of perfectly collinear numeric vectors in a data.frame?



Ideal is something like this:

find_all_perfectly_collinear_pairs( data.frame( A = c( 1, 2, 3), 
                                                B = c( 2, 4, 6), 
                                                C = c( 3, 5, 1 ) ) );

     [,1] [,2]
[1,] "A"  "B" 

indicating that A and B are perfectly collinear (but not B and C or A and C).

All predictors are numeric vectors that contain only integers. Looking at about 100 rows of 25 columns.

like image 490
astletron Avatar asked Jan 21 '16 16:01


2 Answers

The caret package has a function that does this. It returns a list with the column numbers that are linear combinations of one another and the columns which can be removed to resolve this:

 df = data.frame( A = c( 1, 2, 3), 
                  B = c( 2, 4, 6), 
                  C = c( 3, 5, 1 ))
 ## $linearCombos
 ## $linearCombos[[1]]
 ## [1] 2 1
 ## $remove
 ## [1] 2


Edited to get the column names as the result as per the OPs question

If you want the column names:

 lincomb = caret::findLinearCombos(df)
 ## [1] "B" "A"


Additional edit for the case of more than one set of linear combinations. Say your original data frame had multiple instances of linear combinations, you could use lapply over the list of linear combinations returned from findLinearCombos

 df = data.frame( A = c( 1, 2, 3), 
        B = c( 2, 4, 6), 
        C = c( 3, 5, 1 ),
        D = c( 6, 10, 2))
 lincomb = caret::findLinearCombos(df)
 lapply(lincomb$linearCombos, function(x) colnames(df)[x])
 ## [[1]]
 ## [1] "B" "A"
 ## [[2]]
 ## [1] "D" "C"


Updated to address OP comment. If you want to filter out columns to create a new data frame without linear combinations the other element of findLinearCombos output is what to remove.

like image 187
jamieRowen Avatar answered Nov 07 '22 19:11


You can use which with arr.ind=TRUE to grab the entries of the correlation matrix that are sufficiently close to 1, and you can then subset to the entries below the diagonal of the correlation matrix:

(positions <- subset(as.data.frame(which(cor(dat) > 0.9999, arr.ind=TRUE)), row < col))
#   row col
# A   1   2

If you wanted to get the names of the variables instead of their column numbers, you can do that conversion:

sapply(positions, function(x) names(dat)[x])
# row col 
# "A" "B"

If you wanted to remove these columns from your data frame before performing linear regression (as you suggest in the comments on your question), then you can simply do:

(dat.smaller <- dat[,-unique(positions$row)])
#   B C
# 1 2 3
# 2 4 5
# 3 6 1

Note that there's no actual need to compute the column names in this case and that it's actually more convenient to use the column numbers as outputted by the which function with arr.ind=TRUE.

like image 42
josliber Avatar answered Nov 07 '22 19:11
