Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Score number of factors in common with R

Tags:

dataframe

r

apply

I'm dealing with a tricky problem. Suppose I have the following data:

df <- data.frame(matrix(ncol = 0, nrow = 7))
df$x <- factor(c("blue","blue","red","red","green","green","black"))
df$y <- factor(c("A","B","A","C","B","C","A"))
df$z <- c(1998, 1998, 1998, 1998, 1999, 2000, 2001)

We can see that A and B have blue in common, but not red or black. A and C have red in common, but not blue, green, or black. And so on.

I want to generate a matrix that scores the proportion of colors that letters i,j have in common out of the union of all colors they occupy (but not colors unoccupied by either letter). In other words, the diagonals would be the number of colors letter i occupies total, and the off-diagonals would be the share of colors letter i jointly occupies with letter j for all i,j.

I can do it for each pair A,B individually with something like this:

df.A <- df[df$x %in% unique(df$x[df$y=="A"]),] # number of rows occupied by A 

df.B <- df[df$x %in% unique(df$x[df$y=="B"]),] # number of rows occupied by B

length(df.A$y[df.A$y=="B"]) # number of A's rows occupied by B

length(df.A$y[df.A$y=="B"]) / (length(df.A$y[df.A$y=="A"])) # proportion of times B agrees with A; i.e. (B|A) / A

In this example, we find that A occupies three colors total, and B two total. Of these, A and B only have one in common. Out of everything A occupies (n=3, NOT the entire set of n=6), B only overlaps on one, for a proportion of 0.333.

In my actual data have thousands of rows and hundreds of factor levels, so it's impossible to do all the permutations by hand. But I can't figure out how to write a function that does it, even after much searching. I figure there must be a straightforward solution that I'm overlooking. Please help!

UPDATE: Thanks to @Ian Campbell and @thelatemail, the solution is simply:

t(table(df[,1:2])) %*% table(df[,1:2])

or crossprod(table(df$x, df$y))

To answer the rest of my own question, I can obtain the proportions I want simply by taking:

x <- t(table(df[,1:2])) %*% table(df[,1:2])

x / diag(x)
like image 315
beddotcom Avatar asked Nov 22 '25 05:11

beddotcom


1 Answers

I love it when I can use linear algebra for something.

t(table(df[,1:2])) %*% table(df[,1:2])
   y
y   A B C
  A 3 1 1
  B 1 2 1
  C 1 1 2

Edit: As noted by @thelatemail, there is a built in (potentially faster) function as well:

crossprod(table(df$x, df$y))

    A B C
  A 3 1 1
  B 1 2 1
  C 1 1 2
like image 166
Ian Campbell Avatar answered Nov 23 '25 21:11

Ian Campbell



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!