Compute similarity percentage OR Compute correlation between more than 2 objects

Question

Consider I have four objects (a,b,c,d), and I ask five persons to label them (category 1 or 2) according to their physical appearance or something else. The labels provided by five persons for these objects are shown as

df <- data.frame(a = c(1,2,1,2,1), b=c(1,2,2,1,1), c= c(2,1,2,2,2), d=c(1,2,1,2,1))

In tabular format,

 ---------
  a b c d
 ---------
  1 1 2 1
  2 2 1 2
  1 2 2 1
  2 1 2 2
  1 1 2 1
----------

Now I want to calculate the percentage of times a group of objects were given the same label (either 1 or 2). For example, objects a, b and d were given the same label by 3 persons out of 5 persons. So its percentage is 3/5 (=60%). While as objects a and d were given same labels by all the people, so its percentage is 5/5 (=100%)

I can calculate this statistic manually, but in my original dataset, I have 50 such objects and the people are 30 and the labels are 4 (1,2,3, and 4). How can I compute such statistics for this bigger dataset automatically? Are there any existing packages/tools in R which can calculate such statistics?

Note: A group can be of any size. In the first example, a group consists of a,b and d while as second example group consists of a and d.

thelatemail · Accepted Answer

If you have numeric ratings, you could use diff to check if you consistently have 0 difference between each rater:

f <- function(cols, data) {
  sum(colSums(diff(t(data[cols]))==0)==(length(cols)-1)) / nrow(data)
}

Results are as expected when applying the function to example groups:

f(c("a","b","d"), df)
#[1] 0.6
f(c("a","d"), df)
#[1] 1

alistaire · Answer

There are two tasks here: firstly, making a list of all the relevant combinations, and secondly, evaluating and aggregating rowwise similarity. combn can start the first task, but it takes a little massaging to arrange the results into a neat list. The second task could be handled with prop.table, but here it's simpler to calculate directly.

Here I've used tidyverse grammar (primarily purrr, which is helpful for handling lists), but convert into base if you like.

library(tidyverse)

map(2:length(df), ~combn(names(df), .x, simplify = FALSE)) %>%    # get combinations
    flatten() %>%    # eliminate nesting
    set_names(map_chr(., paste0, collapse = '')) %>%    # add useful names
    # subset df with combination, see if each row has only one unique value
    map(~apply(df[.x], 1, function(x){n_distinct(x) == 1})) %>% 
    map_dbl(~sum(.x) / length(.x))    # calculate TRUE proportion

##   ab   ac   ad   bc   bd   cd  abc  abd  acd  bcd abcd 
##  0.6  0.2  1.0  0.2  0.6  0.2  0.0  0.6  0.2  0.0  0.0

Silence Dogood · Answer

With base R functions you could do:

 groupVec = c("a","b","d")

 transDF = t(as.matrix(DF))

 subDF  = transDF[rownames(transDF) %in% groupVec,]
 subDF
   # [,1] [,2] [,3] [,4] [,5]
 # a    1    2    1    2    1
 # b    1    2    2    1    1
 # d    1    2    1    2    1

 #if length of unique values is 1, it implies match across all objects, count unique values/total columns = match pct
 match_pct = sum(sapply(as.data.frame(subDF), function(x)  sum(length(unique(x))==1) ))/ncol(subDF)
 match_pct
 # [1] 0.6

Wrapping it in a custom funtion:

 fn_matchPercent = function(groupVec =  c("a","d") ) {


 transDF = t(as.matrix(DF))

 subDF  = transDF[rownames(transDF) %in% groupVec,]

 match_pct = sum(sapply(as.data.frame(subDF), function(x)  sum(length(unique(x))==1) ))/ncol(subDF)


 outputDF = data.frame(groups = paste0(groupVec,collapse=",") ,match_pct = match_pct)

 return(outputDF)

 }

 fn_matchPercent(c("a","d"))
   # groups match_pct
 # 1    a,d         1
 fn_matchPercent(c("a","b","d"))
   # groups match_pct
 # 1  a,b,d       0.6

Sandipan Dey · Answer

Try this:

find.unanimous.percentage <- function(df, at.a.time) {
  cols <- as.data.frame(t(combn(names(df), at.a.time)))
  names(cols) <- paste('O', 1:at.a.time, sep='')
  cols$percent.unanimous <- 100*colMeans(apply(cols, 1, function(x) apply(df[x], 1, function(y) length(unique(y)) == 1)))
  return(cols)
}

find.unanimous.percentage(df, 2) # take 2 at a time

  O1 O2 percent.unanimous
1  a  b                60
2  a  c                20
3  a  d               100
4  b  c                20
5  b  d                60
6  c  d                20

find.unanimous.percentage(df, 3) # take 3 at a time

  O1 O2 O3 percent.unanimous
1  a  b  c                 0
2  a  b  d                60
3  a  c  d                20
4  b  c  d                 0

find.unanimous.percentage(df, 4)  

  O1 O2 O3 O4 percent.unanimous
1  a  b  c  d                 0

Compute similarity percentage OR Compute correlation between more than 2 objects

Tags:

r

machine-learning

statistics

Haroon Rashid

4 Answers

thelatemail

alistaire

Silence Dogood

Sandipan Dey

Recent Activity

Donate For Us

Compute similarity percentage OR Compute correlation between more than 2 objects

Tags:

r

machine-learning

statistics

Haroon Rashid

4 Answers

thelatemail

alistaire

Silence Dogood

Sandipan Dey

Related questions

Recent Activity

Donate For Us