Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looping and clustering

Tags:

algorithm

r

I have to admit this's too hard for me to do it on my own. I have to analyze some data and this step is crucial for me.

Data which I want to analyze:

> dput(tbl_clustering)
structure(list(P1 = structure(c(14L, 14L, 6L, 6L, 6L, 19L, 15L, 
13L, 13L, 13L, 13L, 10L, 10L, 6L, 6L, 10L, 27L, 27L, 27L, 27L, 
27L, 22L, 22L, 22L, 21L, 21L, 21L, 27L, 27L, 27L, 27L, 21L, 21L, 
21L, 28L, 28L, 25L, 25L, 25L, 29L, 29L, 17L, 17L, 17L, 5L, 5L, 
5L, 5L, 20L, 20L, 23L, 23L, 23L, 23L, 7L, 26L, 26L, 24L, 24L, 
24L, 24L, 3L, 3L, 3L, 9L, 8L, 2L, 11L, 11L, 11L, 11L, 11L, 12L, 
12L, 4L, 4L, 4L, 1L, 1L, 1L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 
18L, 18L, 18L, 18L, 16L, 16L, 16L, 16L, 16L, 16L, 16L), .Label = c("AT1G09130", 
"AT1G09620", "AT1G10760", "AT1G14610", "AT1G43170", "AT1G58080", 
"AT2G27680", "AT2G27710", "AT3G03710", "AT3G05590", "AT3G11510", 
"AT3G56130", "AT3G58730", "AT3G61540", "AT4G03520", "AT4G22930", 
"AT4G33030", "AT5G01600", "AT5G04710", "AT5G17990", "AT5G19220", 
"AT5G43940", "AT5G63310", "ATCG00020", "ATCG00380", "ATCG00720", 
"ATCG00770", "ATCG00810", "ATCG00900"), class = "factor"), P2 = structure(c(55L, 
54L, 29L, 4L, 70L, 72L, 18L, 9L, 58L, 68L, 19L, 6L, 1L, 16L, 
34L, 32L, 77L, 12L, 61L, 41L, 71L, 73L, 50L, 11L, 69L, 22L, 60L, 
42L, 47L, 45L, 59L, 30L, 24L, 23L, 77L, 45L, 12L, 47L, 59L, 82L, 
75L, 40L, 26L, 83L, 81L, 47L, 36L, 45L, 2L, 65L, 11L, 38L, 13L, 
31L, 53L, 78L, 7L, 80L, 79L, 7L, 76L, 17L, 10L, 3L, 68L, 51L, 
48L, 62L, 58L, 64L, 68L, 74L, 63L, 14L, 57L, 33L, 56L, 39L, 52L, 
35L, 43L, 25L, 27L, 21L, 15L, 5L, 49L, 37L, 66L, 20L, 44L, 69L, 
22L, 67L, 57L, 8L, 46L, 28L), .Label = c("AT1G01090", "AT1G02150", 
"AT1G03870", "AT1G09795", "AT1G13060", "AT1G14320", "AT1G15820", 
"AT1G17745", "AT1G20630", "AT1G29880", "AT1G29990", "AT1G43170", 
"AT1G52340", "AT1G52670", "AT1G56450", "AT1G59900", "AT1G69830", 
"AT1G75330", "AT1G78570", "AT2G05840", "AT2G28000", "AT2G34590", 
"AT2G35040", "AT2G37020", "AT2G40300", "AT2G42910", "AT2G44050", 
"AT2G44350", "AT2G45440", "AT3G01500", "AT3G03980", "AT3G04840", 
"AT3G07770", "AT3G13235", "AT3G14415", "AT3G18740", "AT3G22110", 
"AT3G22480", "AT3G22960", "AT3G51840", "AT3G54210", "AT3G54400", 
"AT3G56090", "AT3G60820", "AT4G00100", "AT4G00570", "AT4G02770", 
"AT4G11010", "AT4G14800", "AT4G18480", "AT4G20760", "AT4G26530", 
"AT4G28750", "AT4G30910", "AT4G30920", "AT4G33760", "AT4G34200", 
"AT5G02500", "AT5G02960", "AT5G10920", "AT5G12250", "AT5G13120", 
"AT5G16390", "AT5G18380", "AT5G35360", "AT5G35590", "AT5G35630", 
"AT5G35790", "AT5G48300", "AT5G52100", "AT5G56030", "AT5G60160", 
"AT5G64300", "AT5G67360", "ATCG00160", "ATCG00270", "ATCG00380", 
"ATCG00540", "ATCG00580", "ATCG00680", "ATCG00750", "ATCG00820", 
"ATCG01110"), class = "factor"), No_Interactions = c(8L, 5L, 
5L, 9L, 7L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 5L, 8L, 6L, 
5L, 5L, 5L, 5L, 5L, 5L, 10L, 6L, 6L, 5L, 5L, 5L, 5L, 8L, 5L, 
5L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 
6L, 5L, 5L, 6L, 5L, 5L, 6L, 5L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 
5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 7L, 
8L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 7L, 5L, 5L, 
6L)), .Names = c("P1", "P2", "No_Interactions"), class = "data.frame", row.names = c(NA, 
-98L))

To explain better what I want to achieve I will paste some rows over here:

        P1        P2 No_Interactions
1  AT3G61540 AT4G30920               8
2  AT3G61540 AT4G30910               5
3  AT1G58080 AT2G45440               5
4  AT1G58080 AT1G09795               9
5  AT1G58080 AT5G52100               7
6  AT5G04710 AT5G60160               6
7  AT4G03520 AT1G75330               5
8  AT3G58730 AT1G20630               5
9  AT3G58730 AT5G02500               5
10 AT3G58730 AT5G35790               5

First of all the new column Cluster has to be created. Next we focus only on two columns P1 and P2. As you can see in first row we have two names AT3G61540 and AT4G30920 and that's our starting point (loop I believe will be necessary). We put the number 1 in Cluster column. Than we take first name AT3G61540 and scan through both columns P1 and P2 if we find this name once again somewhere with other name than in first row we put number 1 as well in Cluster. Next we take second name from first row AT4G30920 and do the same screening through whole data.

The next step will be to analyze next row and do exactly the same things. In that case in the next row we have exactly the same name for P1 that means we don't need to screen it but the second name AT4G30910 is different so would be great to screen with that one. The problem which appears here is that this row should be the cluster 1 as well. The cluster 2 starts with third row because we have completely new pair of names.

I am aware that's not so easy task and probably it has to be done in couple steps.

EDIT: The output I would like to get:

       P1        P2 No_Interactions      Cluster
1  AT3G61540 AT4G30920               8      1
2  AT3G61540 AT4G30910               5      1
3  AT1G58080 AT2G45440               5      2
4  AT1G58080 AT1G09795               9      2
5  AT1G58080 AT5G52100               7      2
6  AT5G04710 AT5G60160               6      3
7  AT5G52100 AT1G75330               5      2 ### Cluster 2 because AT5G52100 was found in the row number 5 as a partner of AT1G58080
8  AT3G58730 AT1G20630               5      5
9  AT3G58730 AT5G02500               5      5
10 AT3G58730 AT3G61540               5      1 ## Cluster 1 because AT3G61540 was found in first row.
like image 665
Shaxi Liver Avatar asked Jul 16 '15 12:07

Shaxi Liver


4 Answers

I corrected my initial answer and propose you a functional programming approach, using Map and recursion to find your clusters:

library(magrittr)

similar = function(u,v) if(length(intersect(u,v))==0) FALSE else TRUE

clusterify = function(df)
{ 
    clusters = df$cluster

    if(!any(clusters==0)) return(df)

    idx = pmatch(0, clusters)
    lst = Map(c, as.character(df[,1]), as.character(df[,2]))
    el  = c(as.character(df[idx, 1]), as.character(df[idx, 2]))

    K = lst %>%
        sapply(similar, v=el) %>%
        add(0)

    mask = if(any(clusters!=0 & K==1))

    if(any(mask))
    {
        cl = min(clusters[mask])
        df[K==1,]$cluster = cl
    }
    else
    {
        df[K==1,]$cluster = max(clusters) + 1
    }

    clusterify(df)
}

You can use it by clusterify(transform(df, cluster=0))

For example, the clustering operates correctly on your example, by taking cluster 9 (you can check other clusters):

subset(clusterify(transform(df, cluster=0)), cluster==9)
#          P1        P2 No_Interactions cluster
#25 AT5G19220 AT5G48300              10       9
#26 AT5G19220 AT2G34590               6       9
#27 AT5G19220 AT5G10920               6       9
#32 AT5G19220 AT3G01500               8       9
#33 AT5G19220 AT2G37020               5       9
#34 AT5G19220 AT2G35040               5       9
#92 AT4G22930 AT5G48300               5       9
#93 AT4G22930 AT2G34590               5       9
#94 AT4G22930 AT5G35630               5       9
#95 AT4G22930 AT4G34200               7       9
#96 AT4G22930 AT1G17745               5       9
#97 AT4G22930 AT4G00570               5       9
#98 AT4G22930 AT2G44350               6       9
like image 108
Colonel Beauvel Avatar answered Oct 31 '22 12:10

Colonel Beauvel


You could use library igraph to make an undirected graph in which you cluster connected composents:

library('igraph')

# make graph and cluster
g = graph.data.frame(tbl_clustering[,c('P1', 'P2')], directed=FALSE)
c = clusters(g)

# append cluster number to original data
tbl_clustering$cluster = sapply(as.vector(tbl_clustering$P1), function(x)c$membership[x])

This assigns clusters to the entries (here the first rows):

> head(tbl_clustering, 8)
         P1        P2 No_Interactions cluster
1 AT3G61540 AT4G30920               8       1
2 AT3G61540 AT4G30910               5       1
3 AT1G58080 AT2G45440               5       2
4 AT1G58080 AT1G09795               9       2
5 AT1G58080 AT5G52100               7       2
6 AT5G04710 AT5G60160               6       3
7 AT4G03520 AT1G75330               5       4
8 AT3G58730 AT1G20630               5       5
like image 26
user1981275 Avatar answered Oct 31 '22 11:10

user1981275


I believe you want to divide your data set into equivalence classes. I have an implementation based on a algorithm in Numerical Recipes. I have included the code below. It can be used as follows:

source("equivalence.R")
ids <- unique(c(levels(data[[1]]), levels(data[[2]])))
classes <- equivalence(ids, data[1:2])
data$class <- classes[match(data$P1, ids)]

equivalence.R

library(Rcpp)
Rcpp::sourceCpp('equivalence.cpp')

equivalence <- function(x, rules) {
  tmp <- unique(x)
  tmp <- tmp[!is.na(tmp)]
  a <- match(rules[[1]], tmp)
  b <- match(rules[[2]], tmp)
  sel <- !is.na(a) & !is.na(b)
  if (any(!sel)) {
    warning("Not all values in rules are present in x.")
    a <- a[sel]
    b <- b[sel]
  }
  res <- c_equivalence(as.integer(a)-1L, as.integer(b)-1L, 
    as.integer(length(tmp)))

  res[match(x, tmp)] + 1L
}

equivalence.cpp

#include <R.h>
#include <Rinternals.h>
#include <string>

extern "C" {
  // [[Rcpp::export]]
  SEXP c_equivalence(SEXP ra, SEXP rb, SEXP rn) {
    try {
      if (LENGTH(ra) != LENGTH(rb)) 
        throw std::string("Lengths of a and be do not match.");
      int* a = INTEGER(ra);
      int* b = INTEGER(rb);
      int  m = LENGTH(ra);
      int  n = INTEGER(rn)[0];
      SEXP classes = PROTECT(allocVector(INTSXP, n)); 
      int* cls = INTEGER(classes);
      //Initialize each element its own class.
      for (int k = 0; k < n; k++) cls[k] = k;
      //For each piece of input information...
      for (int l = 0; l < m; l++) {
        //Track first element up to its ancestor.
        int j = a[l];
        while (cls[j] != j) j = cls[j];
        //Track second element up to its ancestor.
        int k = b[l];
        while (cls[k] != k) k = cls[k];
        //If they are not already related, make them so.
        if (j != k) {
          cls[j] = k;
        }
      }
      //Final sweep up to highest ancestors.
      for (int j = 0; j < n; j++) {
        while (cls[j] != cls[cls[j]]) cls[j] = cls[cls[j]];
      }
      UNPROTECT(1);
      return classes;
    } catch(const std::string& e) {
      error(e.c_str());
      return R_NilValue;
    } catch (...) {
      error("Uncaught exception.");
      return R_NilValue;
    }
  }

}
like image 37
Jan van der Laan Avatar answered Oct 31 '22 11:10

Jan van der Laan


Okay, here is a new answer, which goes some of the way. Again, dat is the data frame.

Cluster <- rep(NA, length(dat[, 1])) #initialise
for(r in 1:length(Cluster)){
    if(identical(as.numeric(r), 1)){Cmatches <- matrix(c(as.character(dat[1, 1]), as.character(dat[1, 2])), 2, 1)}
    matched <- F
    for(cl in 1:length(Cmatches[1,])){
        if(sum(c(as.character(dat[r, 1]), as.character(dat[r, 2])) %in% Cmatches[, cl]) != 0){
            #add P1 and P2 from this row to the cluster which it matches
            Cmatches <- rbind(Cmatches, matrix(c(if(cl != 1){rep("", (cl - 1)*2)}else{character(0)}, as.character(dat[r, 1]), as.character(dat[r, 2]), if(cl != length(Cmatches[1,])){rep("", (length(Cmatches[1, ]) - cl)*2)}else{character(0)}), 2, length(Cmatches[1,]), byrow = F))
            matched <- T
            Cluster[r] <- cl
        }
    }
    if(!matched){
        #add a new cluster, because doesn't match any existing
        Cmatches <- cbind(Cmatches, c(c(as.character(dat[r, 1]), as.character(dat[r, 2])), rep("", length(Cmatches[, 1]) - 2)))
        Cluster[r] <- length(Cmatches[1,])
    }
}

After this, you would take the Cmatch matrix and then check for matches between the clusters using if(sum(match(Cmatch[, cl1], Cmatch[, cl2], incomparables = ""), na.rm = T) != 0) (where cl1 and cl2 are cluster numbers to be compared). If that test was true, then those clusters should be grouped.

Cmatched <- rep(NA, length(Cmatches[1,]))
for(cl in 1:(length(Cmatches[1,]) - 1)){
    for(cl2 in (cl + 1):length(Cmatches[1,])){
        if(sum(match(Cmatches[, cl], Cmatches[, cl2], incomparables = ""), na.rm = T) != 0){
            if(is.na(Cmatched[cl])){
                Cluster[Cluster == cl2] <- cl
                Cmatched[cl2] <- cl
            }else{
                Cluster[Cluster == cl2] <- Cmatched[cl]
                Cmatched[cl2] <- cl
            }
        }
    }
}

And I think that there is your answer. Then just dat <- cbind(dat, Cluster).

like image 1
CJB Avatar answered Oct 31 '22 11:10

CJB