I have huge matrix with a lot of missing values. I want to get the correlation between variables.
1. Is the solution
cor(na.omit(matrix))
better than below?
cor(matrix, use = "pairwise.complete.obs")
I already have selected only variables having more than 20% of missing values.
2. Which is the best method to make sense ?
The correlation coefficient is easy to estimate with the familiar product-moment estimator. It is also straightforward to construct confidence intervals using the variance stabilizing Fisher transformation. If some data are missing, it is not possible to assess the correlation in the usual way.
I would vote for the second option. Sounds like you have a fair amount of missing data and so you would be looking for a sensible multiple imputation strategy to fill in the spaces. See Harrell's text "Regression Modeling Strategies" for a wealth of guidance on 'how's to do this properly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With