Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing NA in correlation matrix

I am doing a correlation matrix for a dataframe of 4000 variable and I would like to remove the variables showing > 0.5 correlation, so I am using this command from the {caret} package.

removeme <- findCorrelation(corrMatrix, cutoff = 0.5, verbose = FALSE)

Error in if (mean(x[i, -i]) > mean(x[-j, j])) { : 
missing value where TRUE/FALSE needed

The data I have is highly variable, and I get NA values here and there. To start with, I couldn't find something that can deal with NA values on the help page of this command, so I decided to remove the NA values myself.

Some variables show NA values all the way across the data, and some show few NA values. I am trying to remove the variables that are causing any NA values, so that I would be able to use the above command. Here's a minimal example of what my data looks like

dput(df) <- structure(list(GK = 1:10, HGF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L), HJI = c(2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
    HDF = c(5L, 6L, 8L, 9L, 5L, 2L, 4L, 3L, 2L, 1L), KLJG = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), KLJA = c(0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L), KDA = c(10L, 11L, 15L, 18L, 
    11L, 10L, 10L, 15L, 12L, 13L), OIE = c(NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA), AFE = c(0L, 0L, 0L, 1L, 0L, 0L, NA, 
    NA, NA, NA)), .Names = c("GK", "HGF", "HJI", "HDF", "KLJG", 
"KLJA", "KDA", "OIE", "AFE"), class = "data.frame", row.names = c(NA, 
-10L))

corrMatrix <- cor(df,use="pairwise.complete.obs")

What would be the best idea to get rid of these annoying variables? I have tried Many commands but did not get to an ideal one that would get rid of these variables. Here are one of my trials:

removeme <- corrMatrix[,which(as.numeric(rowSums(is.na(corrMatrix))) > 100)] 

The issue with this command that if there was over a 100 faulty variables (giving NA in correlation matrix) the normal variables will be removed, as the columns of the normal variable will have > 100 NA values.

I hope this edit made my question more clear. Cheers.

like image 611
Error404 Avatar asked Oct 01 '13 09:10

Error404


People also ask

How do I get rid of NAS in a correlation matrix?

If you simply want to get rid of any column that has one or more NA s, then just do However, even with missing data, you can compute a correlation matrix with no NA values by specifying the use parameter in the function cor. Setting it to either pairwise.complete.obs or complete.obs will result in a correlation matrix with no NA s.

How do I get a correlation matrix with no NA values?

However, even with missing data, you can compute a correlation matrix with no NA values by specifying the use parameter in the function cor. Setting it to either pairwise.complete.obs or complete.obs will result in a correlation matrix with no NAs.

How to remove Na values when performing calculation?

Method 2: Remove NA Values When Performing Calculation Using na.rm max (data, na.rm=T) mean (data, na.rm=T) ... Method 3: Remove NA Values When Performing Calculation Using na.omit

Should we ignore NAS in paired correlation tests?

One may be tempted to remove those rows from the data frame which have one or more missing values, however you may be left with too small number of rows in your data to run any meaningful correlation analysis. We would, of course, prefer to get the most from our data. Therefore, we would like to ignore NAs in our paired correlation tests.


Video Answer


2 Answers

If you simply want to get rid of any column that has one or more NAs, then just do

x<-x[,colSums(is.na(x))==0]

However, even with missing data, you can compute a correlation matrix with no NA values by specifying the use parameter in the function cor. Setting it to either pairwise.complete.obs or complete.obs will result in a correlation matrix with no NAs.

complete.obs will ignore all rows with missing data, whereas pairwise.complete.obs will just ignore the missing pairs of data. Note that, although pairwise.complete.obs "sounds better" because it uses more of the available data, but it isn't guaranteed to produce a positive-definite correlation matrix, which could be a problem.

> set.seed(123)
> x<-array(rnorm(500),c(100,5))
> x[sample(500,3)]<-NA
> cor(x)
     [,1] [,2] [,3]        [,4]        [,5]
[1,]    1   NA   NA          NA          NA
[2,]   NA    1   NA          NA          NA
[3,]   NA   NA    1          NA          NA
[4,]   NA   NA   NA  1.00000000 -0.01925986
[5,]   NA   NA   NA -0.01925986  1.00000000
> cor(x,use="pairwise.complete.obs")
            [,1]        [,2]        [,3]        [,4]        [,5]
[1,]  1.00000000 -0.04377085 -0.18049501 -0.04914247 -0.19374986
[2,] -0.04377085  1.00000000  0.01296008  0.02606083 -0.12333765
[3,] -0.18049501  0.01296008  1.00000000 -0.03218139 -0.02675554
[4,] -0.04914247  0.02606083 -0.03218139  1.00000000 -0.01925986
[5,] -0.19374986 -0.12333765 -0.02675554 -0.01925986  1.00000000
> cor(x,use="complete.obs")
            [,1]        [,2]        [,3]        [,4]        [,5]
[1,]  1.00000000 -0.06263112 -0.17914810 -0.02574970 -0.20504268
[2,] -0.06263112  1.00000000  0.01263764  0.02543900 -0.12571570
[3,] -0.17914810  0.01263764  1.00000000 -0.03866312 -0.02520500
[4,] -0.02574970  0.02543900 -0.03866312  1.00000000 -0.01688848
[5,] -0.20504268 -0.12571570 -0.02520500 -0.01688848  1.00000000
like image 105
mrip Avatar answered Oct 08 '22 14:10

mrip


Before evaluating the correlation for predictors of your dataset remove the zero variance predictors.

to remove zero variance predictors

zv <- apply(df, 2, function(x) length(unique(x)) == 1)

dfr <- df[, !zv](suppose df is the name of your dataset)

n=length(colnames(dfr))

calculate correlation matrix

correlationMatrix <- cor(dfr[,1:n],use="complete.obs")

summarize the correlation matrix

print(correlationMatrix)

find attributes that are highly corrected (ideally >0.7)

highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=(0.7),verbose = FALSE)

print indexes of highly correlated attributes

print(highlyCorrelated)

important variables

important_var=colnames(df[,-highlyCorrelated])
like image 6
Madhurima Pal Avatar answered Oct 08 '22 14:10

Madhurima Pal