i'm trying to get the Pearson correlation coefficient for all rows in a data frame relative to each other. there are values that are empty (NA) and this seems to be presenting a problem that I don't encounter when running cor() on 2 vectors with missing values. This is the correct result on 2 vectors:
x <- c(NA, 4.5, NA, 4, NA, 1)
y <- c(2.5, 3.5, 3, 3.5, 3, 2.5)
cor(x,y, use = "complete.obs")
[1] 0.9912407
and here is the result when they are part of a data frame:
cor(t(critics1), use = "complete.obs")
y a b c d e x
y 1 NA NA NA NA NA NA
a NA 1 1 1 -1 1 -1
b NA 1 1 1 -1 1 -1
c NA 1 1 1 -1 1 -1
d NA -1 -1 -1 1 -1 1
e NA 1 1 1 -1 1 -1
x NA -1 -1 -1 1 -1 1
Warning message:
In cor(t(critics1), use = "complete.obs") : the standard deviation is zero
Why is the use parameter not having the same effect? Here is what the critics1 dataframe looks like;
film1 film2 film3 film4 film5 film6
y 2.5 3.5 3.0 3.5 3.0 2.5
a 3.0 3.5 1.5 5.0 3.0 3.5
b 2.5 3.0 NA 3.5 4.0 NA
c NA 3.5 3.0 4.0 4.5 2.5
d 3.0 4.0 2.0 3.0 3.0 2.0
e 3.0 4.0 NA 5.0 3.0 3.5
x NA 4.5 NA 4.0 NA 1.0
R vectors are used to hold multiple data values of the same datatype and are similar to arrays in C language. Data frame is a 2 dimensional table structure which is used to hold the values. In the data frame, each column contains the value of one variable and also each row contains the value of each column.
In R, the Vector contains elements of the same type and the types can be logical, integer, double, character, complex or raw. You can create a Vector using c() . Whereas the R Data frame is a 2-dimensional structure that is used to hold the values in rows and columns.
As @joran speculated, when you transpose critics1
, there are only two complete observations (i.e. rows with no missing values). That's why all of the correlations are either 1
or -1
or (for those involving y
, which has value 3.5 in both complete rows), NA
.
t(critics1)
# y a b c d e x
# film1 2.5 3.0 2.5 NA 3 3.0 NA
# film2 3.5 3.5 3.0 3.5 4 4.0 4.5
# film3 3.0 1.5 NA 3.0 2 NA NA
# film4 3.5 5.0 3.5 4.0 3 5.0 4.0
# film5 3.0 3.0 4.0 4.5 3 3.0 NA
# film6 2.5 3.5 NA 2.5 2 3.5 1.0
If you use use="pairwise.complete.obs"
instead of use="complete.obs"
, it works as you'd like:
cor(t(df), use="pairwise.complete.obs")["y","x"] # Extract correlation of y and x
# [1] 0.9912407
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With