I am doing a correlation matrix for a dataframe of 4000 variable and I would like to remove the variables showing > 0.5 correlation, so I am using this command from the {caret} package.
removeme <- findCorrelation(corrMatrix, cutoff = 0.5, verbose = FALSE)
Error in if (mean(x[i, -i]) > mean(x[-j, j])) { :
missing value where TRUE/FALSE needed
The data I have is highly variable, and I get NA values here and there. To start with, I couldn't find something that can deal with NA values on the help page of this command, so I decided to remove the NA values myself.
Some variables show NA values all the way across the data, and some show few NA values. I am trying to remove the variables that are causing any NA values, so that I would be able to use the above command. Here's a minimal example of what my data looks like
dput(df) <- structure(list(GK = 1:10, HGF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), HJI = c(2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
HDF = c(5L, 6L, 8L, 9L, 5L, 2L, 4L, 3L, 2L, 1L), KLJG = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), KLJA = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), KDA = c(10L, 11L, 15L, 18L,
11L, 10L, 10L, 15L, 12L, 13L), OIE = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA), AFE = c(0L, 0L, 0L, 1L, 0L, 0L, NA,
NA, NA, NA)), .Names = c("GK", "HGF", "HJI", "HDF", "KLJG",
"KLJA", "KDA", "OIE", "AFE"), class = "data.frame", row.names = c(NA,
-10L))
corrMatrix <- cor(df,use="pairwise.complete.obs")
What would be the best idea to get rid of these annoying variables? I have tried Many commands but did not get to an ideal one that would get rid of these variables. Here are one of my trials:
removeme <- corrMatrix[,which(as.numeric(rowSums(is.na(corrMatrix))) > 100)]
The issue with this command that if there was over a 100 faulty variables (giving NA in correlation matrix) the normal variables will be removed, as the columns of the normal variable will have > 100 NA values.
I hope this edit made my question more clear. Cheers.
If you simply want to get rid of any column that has one or more NA s, then just do However, even with missing data, you can compute a correlation matrix with no NA values by specifying the use parameter in the function cor. Setting it to either pairwise.complete.obs or complete.obs will result in a correlation matrix with no NA s.
However, even with missing data, you can compute a correlation matrix with no NA values by specifying the use parameter in the function cor. Setting it to either pairwise.complete.obs or complete.obs will result in a correlation matrix with no NAs.
Method 2: Remove NA Values When Performing Calculation Using na.rm max (data, na.rm=T) mean (data, na.rm=T) ... Method 3: Remove NA Values When Performing Calculation Using na.omit
One may be tempted to remove those rows from the data frame which have one or more missing values, however you may be left with too small number of rows in your data to run any meaningful correlation analysis. We would, of course, prefer to get the most from our data. Therefore, we would like to ignore NAs in our paired correlation tests.
If you simply want to get rid of any column that has one or more NA
s, then just do
x<-x[,colSums(is.na(x))==0]
However, even with missing data, you can compute a correlation matrix with no NA
values by specifying the use
parameter in the function cor
. Setting it to either pairwise.complete.obs
or complete.obs
will result in a correlation matrix with no NA
s.
complete.obs
will ignore all rows with missing data, whereas pairwise.complete.obs
will just ignore the missing pairs of data. Note that, although pairwise.complete.obs
"sounds better" because it uses more of the available data, but it isn't guaranteed to produce a positive-definite correlation matrix, which could be a problem.
> set.seed(123)
> x<-array(rnorm(500),c(100,5))
> x[sample(500,3)]<-NA
> cor(x)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 NA NA NA NA
[2,] NA 1 NA NA NA
[3,] NA NA 1 NA NA
[4,] NA NA NA 1.00000000 -0.01925986
[5,] NA NA NA -0.01925986 1.00000000
> cor(x,use="pairwise.complete.obs")
[,1] [,2] [,3] [,4] [,5]
[1,] 1.00000000 -0.04377085 -0.18049501 -0.04914247 -0.19374986
[2,] -0.04377085 1.00000000 0.01296008 0.02606083 -0.12333765
[3,] -0.18049501 0.01296008 1.00000000 -0.03218139 -0.02675554
[4,] -0.04914247 0.02606083 -0.03218139 1.00000000 -0.01925986
[5,] -0.19374986 -0.12333765 -0.02675554 -0.01925986 1.00000000
> cor(x,use="complete.obs")
[,1] [,2] [,3] [,4] [,5]
[1,] 1.00000000 -0.06263112 -0.17914810 -0.02574970 -0.20504268
[2,] -0.06263112 1.00000000 0.01263764 0.02543900 -0.12571570
[3,] -0.17914810 0.01263764 1.00000000 -0.03866312 -0.02520500
[4,] -0.02574970 0.02543900 -0.03866312 1.00000000 -0.01688848
[5,] -0.20504268 -0.12571570 -0.02520500 -0.01688848 1.00000000
Before evaluating the correlation for predictors of your dataset remove the zero variance predictors.
zv <- apply(df, 2, function(x) length(unique(x)) == 1)
dfr <- df[, !zv](suppose df is the name of your dataset)
n=length(colnames(dfr))
correlationMatrix <- cor(dfr[,1:n],use="complete.obs")
print(correlationMatrix)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=(0.7),verbose = FALSE)
print(highlyCorrelated)
important_var=colnames(df[,-highlyCorrelated])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With