Ideal is something like this:
find_all_perfectly_collinear_pairs( data.frame( A = c( 1, 2, 3),
B = c( 2, 4, 6),
C = c( 3, 5, 1 ) ) );
[,1] [,2]
[1,] "A" "B"
indicating that A and B are perfectly collinear (but not B and C or A and C).
All predictors are numeric vectors that contain only integers. Looking at about 100 rows of 25 columns.
The caret
package has a function that does this. It returns a list with the column numbers that are linear combinations of one another and the columns which can be removed to resolve this:
df = data.frame( A = c( 1, 2, 3),
B = c( 2, 4, 6),
C = c( 3, 5, 1 ))
caret::findLinearCombos(df)
## $linearCombos
## $linearCombos[[1]]
## [1] 2 1
## $remove
## [1] 2
Edited to get the column names as the result as per the OPs question
If you want the column names:
lincomb = caret::findLinearCombos(df)
colnames(df)[lincomb$linearCombos[[1]]]
## [1] "B" "A"
Additional edit for the case of more than one set of linear combinations. Say your original data frame had multiple instances of linear combinations, you could use lapply
over the list of linear combinations returned from findLinearCombos
df = data.frame( A = c( 1, 2, 3),
B = c( 2, 4, 6),
C = c( 3, 5, 1 ),
D = c( 6, 10, 2))
lincomb = caret::findLinearCombos(df)
lapply(lincomb$linearCombos, function(x) colnames(df)[x])
## [[1]]
## [1] "B" "A"
##
## [[2]]
## [1] "D" "C"
Updated to address OP comment. If you want to filter out columns to create a new data frame without linear combinations the other element of findLinearCombos
output is what to remove.
df[-lincomb$remove]
You can use which
with arr.ind=TRUE
to grab the entries of the correlation matrix that are sufficiently close to 1, and you can then subset to the entries below the diagonal of the correlation matrix:
(positions <- subset(as.data.frame(which(cor(dat) > 0.9999, arr.ind=TRUE)), row < col))
# row col
# A 1 2
If you wanted to get the names of the variables instead of their column numbers, you can do that conversion:
sapply(positions, function(x) names(dat)[x])
# row col
# "A" "B"
If you wanted to remove these columns from your data frame before performing linear regression (as you suggest in the comments on your question), then you can simply do:
(dat.smaller <- dat[,-unique(positions$row)])
# B C
# 1 2 3
# 2 4 5
# 3 6 1
Note that there's no actual need to compute the column names in this case and that it's actually more convenient to use the column numbers as outputted by the which
function with arr.ind=TRUE
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With