Removing NA in correlation matrix

Tags:

I am doing a correlation matrix for a dataframe of 4000 variable and I would like to remove the variables showing > 0.5 correlation, so I am using this command from the {caret} package.

removeme <- findCorrelation(corrMatrix, cutoff = 0.5, verbose = FALSE)

Error in if (mean(x[i, -i]) > mean(x[-j, j])) { : 
missing value where TRUE/FALSE needed

The data I have is highly variable, and I get NA values here and there. To start with, I couldn't find something that can deal with NA values on the help page of this command, so I decided to remove the NA values myself.

Some variables show NA values all the way across the data, and some show few NA values. I am trying to remove the variables that are causing any NA values, so that I would be able to use the above command. Here's a minimal example of what my data looks like

dput(df) <- structure(list(GK = 1:10, HGF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L), HJI = c(2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
    HDF = c(5L, 6L, 8L, 9L, 5L, 2L, 4L, 3L, 2L, 1L), KLJG = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), KLJA = c(0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L), KDA = c(10L, 11L, 15L, 18L, 
    11L, 10L, 10L, 15L, 12L, 13L), OIE = c(NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA), AFE = c(0L, 0L, 0L, 1L, 0L, 0L, NA, 
    NA, NA, NA)), .Names = c("GK", "HGF", "HJI", "HDF", "KLJG", 
"KLJA", "KDA", "OIE", "AFE"), class = "data.frame", row.names = c(NA, 
-10L))

corrMatrix <- cor(df,use="pairwise.complete.obs")

What would be the best idea to get rid of these annoying variables? I have tried Many commands but did not get to an ideal one that would get rid of these variables. Here are one of my trials:

removeme <- corrMatrix[,which(as.numeric(rowSums(is.na(corrMatrix))) > 100)]

The issue with this command that if there was over a 100 faulty variables (giving NA in correlation matrix) the normal variables will be removed, as the columns of the normal variable will have > 100 NA values.

I hope this edit made my question more clear. Cheers.

611

asked Oct 01 '13 09:10

Error404

Video Answer

2 Answers

If you simply want to get rid of any column that has one or more NAs, then just do

x<-x[,colSums(is.na(x))==0]

However, even with missing data, you can compute a correlation matrix with no NA values by specifying the use parameter in the function cor. Setting it to either pairwise.complete.obs or complete.obs will result in a correlation matrix with no NAs.

complete.obs will ignore all rows with missing data, whereas pairwise.complete.obs will just ignore the missing pairs of data. Note that, although pairwise.complete.obs "sounds better" because it uses more of the available data, but it isn't guaranteed to produce a positive-definite correlation matrix, which could be a problem.

> set.seed(123)
> x<-array(rnorm(500),c(100,5))
> x[sample(500,3)]<-NA
> cor(x)
     [,1] [,2] [,3]        [,4]        [,5]
[1,]    1   NA   NA          NA          NA
[2,]   NA    1   NA          NA          NA
[3,]   NA   NA    1          NA          NA
[4,]   NA   NA   NA  1.00000000 -0.01925986
[5,]   NA   NA   NA -0.01925986  1.00000000
> cor(x,use="pairwise.complete.obs")
            [,1]        [,2]        [,3]        [,4]        [,5]
[1,]  1.00000000 -0.04377085 -0.18049501 -0.04914247 -0.19374986
[2,] -0.04377085  1.00000000  0.01296008  0.02606083 -0.12333765
[3,] -0.18049501  0.01296008  1.00000000 -0.03218139 -0.02675554
[4,] -0.04914247  0.02606083 -0.03218139  1.00000000 -0.01925986
[5,] -0.19374986 -0.12333765 -0.02675554 -0.01925986  1.00000000
> cor(x,use="complete.obs")
            [,1]        [,2]        [,3]        [,4]        [,5]
[1,]  1.00000000 -0.06263112 -0.17914810 -0.02574970 -0.20504268
[2,] -0.06263112  1.00000000  0.01263764  0.02543900 -0.12571570
[3,] -0.17914810  0.01263764  1.00000000 -0.03866312 -0.02520500
[4,] -0.02574970  0.02543900 -0.03866312  1.00000000 -0.01688848
[5,] -0.20504268 -0.12571570 -0.02520500 -0.01688848  1.00000000

105

answered Oct 08 '22 14:10

mrip

Before evaluating the correlation for predictors of your dataset remove the zero variance predictors.

to remove zero variance predictors

zv <- apply(df, 2, function(x) length(unique(x)) == 1)

dfr <- df[, !zv](suppose df is the name of your dataset)

n=length(colnames(dfr))

calculate correlation matrix

correlationMatrix <- cor(dfr[,1:n],use="complete.obs")

summarize the correlation matrix

print(correlationMatrix)

find attributes that are highly corrected (ideally >0.7)

highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=(0.7),verbose = FALSE)

print indexes of highly correlated attributes

print(highlyCorrelated)

important variables

important_var=colnames(df[,-highlyCorrelated])

answered Oct 08 '22 14:10

Madhurima Pal

Related questions
                            
                                How to update values in a dplyr pipe?
                            
                                Creating a new data frame in R from an exisiting, inadequate data frame
                            
                                subset function with "different than"?
                            
                                Change Date print format from yyyy-mm-dd to dd-mm-yyyy
                            
                                Error running R in Linux
                            
                                Splitting a string into new rows in R [duplicate]
                            
                                Splitting text column into ragged multiple new columns in a data table in R
                            
                                Filter data table by dynamic column name
                            
                                Sum of intervals lengths from an integer vector
                            
                                How to get all possible subsets of a character vector in R?
                            
                                How to calculate cumulative sum? [duplicate]
                            
                                Dummify character column and find unique values [duplicate]
                            
                                summing multiple columns in an R data-frame quickly [duplicate]
                            
                                Remove duplicate element within a row in a specific column
                            
                                Coalesce pairs of variables within a dataframe based on a regular expression
                            
                                Perform 'cross product' of two vectors, but with addition
                            
                                ImageMagick in R
                            
                                How to rename specific variable of a data frame with setNames()?
                            
                                r keeping 0.0 when using paste or paste0
                            
                                How to visualize a map from a netcdf file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With