Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting unique elements when some are synonyms of each other

I am trying to count the number of unique drugs in this list.

my_drugs=c('a', 'b', 'd', 'h', 'q')

I have the following dictionary,which gives me drug synonyms, but it is not set up so that the definitions are only for unique drugs:

dictionary <- read.table(header=TRUE, text="
  drug   names
  a    b;c;d;x
  x    b;c;q
  r    h;g;f
  l   m;n
")

So in this case, there are 2 unique drugs in the list (because a, either directly or indirectly, has synonyms b,d,q). Synonyms of synonyms count as synonyms.

My attempted approach was to first make a dictionary that only had unique drugs on the left side. To do this, I would cycle through the dictionary$drug, grep in dictionary$drug and dictionary$synonyms, take the union of those and replace drug$synonyms, and then delete the other rows from the dictionary.

bigdf=dictionary

  small_df=data.frame("drug"=NA,"names"=NA)

  for(i in 1:nrow(bigdf)){
    search_term=sprintf("*%s*",bigdf$drug[i])
    index=grep(search_term,bigdf$names)
    list=bigdf$names[index]
    list=Reduce(union,list)
    list=paste(list, collapse=";")

    if(!list==""){

    new_row=data.frame("drug"=bigdf$drug[index][1],"names"=list)
    small_df=rbind(small_df,new_row)
    #small_df
    bigdf=bigdf[-index,]
    #dim(bigdf)

    }
    else{
      new_row=data.frame("drug"=bigdf$drug[index][1],"names"="alreadycounted")
      small_df=rbind(small_df,new_row)
    }    
  }

This did not work (some drugs were missing from small_df), and even if it had I'm not sure how I would have used my new dictionary to count the number of unique drugs in my list.

How can I count the number of unique drugs in my_drugs?

Thank you for your help, and let me know if this needs further clarification.

Data Set Size: 200 elements in my_drugs, 2000 rows in dictionary, each drug has 10-12 synonyms.

like image 311
RustlessBroom Avatar asked Dec 11 '17 18:12

RustlessBroom


1 Answers

library(igraph)
df1 = unique(data.frame(do.call(
    rbind, apply(X = dictionary,
                 MARGIN = 1,
                 FUN = function(x) t(combn(unlist(strsplit(x, ";")), 2, sort))))))
g = graph.data.frame(df1)
g2 = delete.vertices(g, unique(V(g)$name)[!unique(V(g)$name) %in% my_drugs])
clusters(g2)$no
#[1] 2
like image 129
d.b Avatar answered Nov 16 '22 20:11

d.b