I am establishing a correlation matrix for my data, which looks like this
df <- structure(list(V1 = c(56, 123, 546, 26, 62, 6, NA, NA, NA, 15
), V2 = c(21, 231, 5, 5, 32, NA, 1, 231, 5, 200), V3 = c(NA,
NA, 24, 51, 53, 231, NA, 153, 6, 700), V4 = c(2, 10, NA, 20,
56, 1, 1, 53, 40, 5000)), .Names = c("V1", "V2", "V3", "V4"), row.names = c(NA,
10L), class = "data.frame")
This gives the following data frame:
V1 V2 V3 V4
1 56 21 NA 2
2 123 231 NA 10
3 546 5 24 NA
4 26 5 51 20
5 62 32 53 56
6 6 NA 231 1
7 NA 1 NA 1
8 NA 231 153 53
9 NA 5 6 40
10 15 200 700 5000
I normally use a complete.obs command to establish my correlation matrix using this command
crm <- cor(df, use="complete.obs", method="pearson")
My question here is, how does the complete.obs treat the data? does it omit any row having a "NA" value, make a "NA" free table and make a correlation matrix at once like this?
df2 <- structure(list(V1 = c(26, 62, 15), V2 = c(5, 32, 200), V3 = c(51,
53, 700), V4 = c(20, 56, 5000)), .Names = c("V1", "V2", "V3",
"V4"), row.names = c(NA, 3L), class = "data.frame")
or does it omit "NA" values in a pairwise fashion, for example when calculating correlation between V1 and V2, the row that contains an NA value in V3, (such as rows 1 and 2 in my example) do they get omitted too?
If this is the case, I am looking forward to establish a command that reserves as much as possible of the data, by omitting NA values in a pairwise fashion.
Many thanks,
"complete. obs": correlations will be computed from complete observations, with an error being raised if there are no complete cases. "na.or. complete": correlations will be computed from complete observations, returning an NA if there are no complete cases.
The cor() function will calculate the correlation between two vectors, or will create a correlation matrix when given a matrix.
complete. obs means that rows that contain a missing value (NA) are ignored. An error is returned if all rows contain at least one missing value. everything means that values for all pairs of columns are computed, but a missing value (NA) is returned for pairs that contain at least one missing value (NA).
The rcorr( ) function in the Hmisc package produces correlations/covariances and significance levels for pearson and spearman correlations. However, input must be a matrix and pairwise deletion is used.
Look at the help file for cor
, i.e. ?cor
. In particular,
If ‘use’ is ‘"everything"’, ‘NA’s will propagate conceptually, i.e., a resulting value will be ‘NA’ whenever one of its contributing observations is ‘NA’.
If ‘use’ is ‘"all.obs"’, then the presence of missing observations will produce an error. If ‘use’ is ‘"complete.obs"’ then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error).
To get a better feel about what is going on, is to create an (even) simpler example:
df1 = df[1:5,1:3]
cor(df1, use="pairwise.complete.obs", method="pearson")
cor(df1, use="complete.obs", method="pearson")
cor(df1[3:5,], method="pearson")
So, when we use complete.obs
, we discard the entire row if an NA
is present. In my example, this means we discard rows 1 and 2. However, pairwise.complete.obs
uses the non-NA
values when calculating the correlation between V1
and V2
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With