Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compute similarity percentage OR Compute correlation between more than 2 objects

Consider I have four objects (a,b,c,d), and I ask five persons to label them (category 1 or 2) according to their physical appearance or something else. The labels provided by five persons for these objects are shown as

df <- data.frame(a = c(1,2,1,2,1), b=c(1,2,2,1,1), c= c(2,1,2,2,2), d=c(1,2,1,2,1))

In tabular format,

 ---------
  a b c d
 ---------
  1 1 2 1
  2 2 1 2
  1 2 2 1
  2 1 2 2
  1 1 2 1
----------

Now I want to calculate the percentage of times a group of objects were given the same label (either 1 or 2). For example, objects a, b and d were given the same label by 3 persons out of 5 persons. So its percentage is 3/5 (=60%). While as objects a and d were given same labels by all the people, so its percentage is 5/5 (=100%)

I can calculate this statistic manually, but in my original dataset, I have 50 such objects and the people are 30 and the labels are 4 (1,2,3, and 4). How can I compute such statistics for this bigger dataset automatically? Are there any existing packages/tools in R which can calculate such statistics?

Note: A group can be of any size. In the first example, a group consists of a,b and d while as second example group consists of a and d.

like image 340
Haroon Rashid Avatar asked Nov 21 '16 05:11

Haroon Rashid


4 Answers

If you have numeric ratings, you could use diff to check if you consistently have 0 difference between each rater:

f <- function(cols, data) {
  sum(colSums(diff(t(data[cols]))==0)==(length(cols)-1)) / nrow(data)
}

Results are as expected when applying the function to example groups:

f(c("a","b","d"), df)
#[1] 0.6
f(c("a","d"), df)
#[1] 1
like image 187
thelatemail Avatar answered Oct 22 '22 10:10

thelatemail


There are two tasks here: firstly, making a list of all the relevant combinations, and secondly, evaluating and aggregating rowwise similarity. combn can start the first task, but it takes a little massaging to arrange the results into a neat list. The second task could be handled with prop.table, but here it's simpler to calculate directly.

Here I've used tidyverse grammar (primarily purrr, which is helpful for handling lists), but convert into base if you like.

library(tidyverse)

map(2:length(df), ~combn(names(df), .x, simplify = FALSE)) %>%    # get combinations
    flatten() %>%    # eliminate nesting
    set_names(map_chr(., paste0, collapse = '')) %>%    # add useful names
    # subset df with combination, see if each row has only one unique value
    map(~apply(df[.x], 1, function(x){n_distinct(x) == 1})) %>% 
    map_dbl(~sum(.x) / length(.x))    # calculate TRUE proportion

##   ab   ac   ad   bc   bd   cd  abc  abd  acd  bcd abcd 
##  0.6  0.2  1.0  0.2  0.6  0.2  0.0  0.6  0.2  0.0  0.0 
like image 24
alistaire Avatar answered Oct 22 '22 10:10

alistaire


With base R functions you could do:

 groupVec = c("a","b","d")

 transDF = t(as.matrix(DF))

 subDF  = transDF[rownames(transDF) %in% groupVec,]
 subDF
   # [,1] [,2] [,3] [,4] [,5]
 # a    1    2    1    2    1
 # b    1    2    2    1    1
 # d    1    2    1    2    1

 #if length of unique values is 1, it implies match across all objects, count unique values/total columns = match pct
 match_pct = sum(sapply(as.data.frame(subDF), function(x)  sum(length(unique(x))==1) ))/ncol(subDF)
 match_pct
 # [1] 0.6

Wrapping it in a custom funtion:

 fn_matchPercent = function(groupVec =  c("a","d") ) {


 transDF = t(as.matrix(DF))

 subDF  = transDF[rownames(transDF) %in% groupVec,]

 match_pct = sum(sapply(as.data.frame(subDF), function(x)  sum(length(unique(x))==1) ))/ncol(subDF)


 outputDF = data.frame(groups = paste0(groupVec,collapse=",") ,match_pct = match_pct)

 return(outputDF)

 }

 fn_matchPercent(c("a","d"))
   # groups match_pct
 # 1    a,d         1
 fn_matchPercent(c("a","b","d"))
   # groups match_pct
 # 1  a,b,d       0.6
like image 42
Silence Dogood Avatar answered Oct 22 '22 10:10

Silence Dogood


Try this:

find.unanimous.percentage <- function(df, at.a.time) {
  cols <- as.data.frame(t(combn(names(df), at.a.time)))
  names(cols) <- paste('O', 1:at.a.time, sep='')
  cols$percent.unanimous <- 100*colMeans(apply(cols, 1, function(x) apply(df[x], 1, function(y) length(unique(y)) == 1)))
  return(cols)
}

find.unanimous.percentage(df, 2) # take 2 at a time

  O1 O2 percent.unanimous
1  a  b                60
2  a  c                20
3  a  d               100
4  b  c                20
5  b  d                60
6  c  d                20

find.unanimous.percentage(df, 3) # take 3 at a time

  O1 O2 O3 percent.unanimous
1  a  b  c                 0
2  a  b  d                60
3  a  c  d                20
4  b  c  d                 0

find.unanimous.percentage(df, 4)  

  O1 O2 O3 O4 percent.unanimous
1  a  b  c  d                 0
like image 40
Sandipan Dey Avatar answered Oct 22 '22 10:10

Sandipan Dey