Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cor() behavior in R different between individual vectors and data.frame

i'm trying to get the Pearson correlation coefficient for all rows in a data frame relative to each other. there are values that are empty (NA) and this seems to be presenting a problem that I don't encounter when running cor() on 2 vectors with missing values. This is the correct result on 2 vectors:

x <- c(NA, 4.5, NA, 4, NA, 1)
y <- c(2.5, 3.5, 3, 3.5, 3, 2.5)
cor(x,y, use = "complete.obs")
[1] 0.9912407

and here is the result when they are part of a data frame:

cor(t(critics1), use = "complete.obs")
   y  a  b  c  d  e  x
y  1 NA NA NA NA NA NA
a NA  1  1  1 -1  1 -1
b NA  1  1  1 -1  1 -1
c NA  1  1  1 -1  1 -1
d NA -1 -1 -1  1 -1  1
e NA  1  1  1 -1  1 -1
x NA -1 -1 -1  1 -1  1
Warning message:
In cor(t(critics1), use = "complete.obs") : the standard deviation is zero

Why is the use parameter not having the same effect? Here is what the critics1 dataframe looks like;

film1 film2 film3 film4 film5 film6
y   2.5   3.5   3.0   3.5   3.0   2.5
a   3.0   3.5   1.5   5.0   3.0   3.5
b   2.5   3.0    NA   3.5   4.0    NA
c    NA   3.5   3.0   4.0   4.5   2.5
d   3.0   4.0   2.0   3.0   3.0   2.0
e   3.0   4.0    NA   5.0   3.0   3.5
x    NA   4.5    NA   4.0    NA   1.0
like image 834
hawkhandler Avatar asked Dec 06 '11 18:12

hawkhandler


People also ask

What is the difference between a vector and a data frame in R?

R vectors are used to hold multiple data values of the same datatype and are similar to arrays in C language. Data frame is a 2 dimensional table structure which is used to hold the values. In the data frame, each column contains the value of one variable and also each row contains the value of each column.

Is a data frame a vector in R?

In R, the Vector contains elements of the same type and the types can be logical, integer, double, character, complex or raw. You can create a Vector using c() . Whereas the R Data frame is a 2-dimensional structure that is used to hold the values in rows and columns.


1 Answers

As @joran speculated, when you transpose critics1, there are only two complete observations (i.e. rows with no missing values). That's why all of the correlations are either 1 or -1 or (for those involving y, which has value 3.5 in both complete rows), NA.

t(critics1)
#         y   a   b   c d   e   x
# film1 2.5 3.0 2.5  NA 3 3.0  NA
# film2 3.5 3.5 3.0 3.5 4 4.0 4.5
# film3 3.0 1.5  NA 3.0 2  NA  NA
# film4 3.5 5.0 3.5 4.0 3 5.0 4.0
# film5 3.0 3.0 4.0 4.5 3 3.0  NA
# film6 2.5 3.5  NA 2.5 2 3.5 1.0

If you use use="pairwise.complete.obs" instead of use="complete.obs", it works as you'd like:

cor(t(df), use="pairwise.complete.obs")["y","x"] # Extract correlation of y and x
# [1] 0.9912407
like image 91
Josh O'Brien Avatar answered Sep 28 '22 06:09

Josh O'Brien