Building the clusters based on the "partnership"

Question

Some time ago I did similar thread but unfortunately the approach which I've used there didn't give me any promising results. I came with an idea how to do it in a different way. So here I am.

Of course example data:

structure(list(Name1 = c("Mazda RX4", "Mazda RX4", "KIA Ceed", 
"Mazda RX4 Wag", "KIA Ceed", "Valiant", "KIA Classic", "Mazda RX4", 
"Dacia", "Merc 280", "Duster 360", "Merc 230"), Name2 = c("Mazda RX4 Wag", 
"Merc 230", "KIA Sport", "Merc 230", "KIA Classic", "Merc 230", 
"KIA Sport", "Merc 240D", "Mazda RX4 Wag", "Merc 450SE", "Valiant", 
"Duster 360")), .Names = c("Name1", "Name2"), class = "data.frame", row.names = c(NA, 
12L))

This data frame contains only two columns. The original data has more but this time I will focus only on those columns.

Just to show how exactly I would like to cluster those guys I will put a desired output:

structure(list(Name1 = c("Mazda RX4", "Mazda RX4", "KIA Ceed", 
"Mazda RX4 Wag", "KIA Ceed", "Valiant", "KIA Classic", "Mazda RX4", 
"Dacia", "Merc 280", "Duster 360", "Merc 230"), Name2 = c("Mazda RX4 Wag", 
"Merc 230", "KIA Sport", "Merc 230", "KIA Classic", "Merc 230", 
"KIA Sport", "Merc 240D", "Mazda RX4 Wag", "Merc 450SE", "Valiant", 
"Duster 360"), cluster = c(1, 1, 2, 1, 2, 3, 2, 0, 0, 0, 3, 3
)), .Names = c("Name1", "Name2", "cluster"), row.names = c(NA, 
12L), class = "data.frame")

As you can see from the ouput I would like to cluster cars depending on the partner which can be find in the second column. So if the cars in one row shares the same "partner" in the next column they should be clustered together.

And how it looks in the table form with a little bit of explanation:

           Name1         Name2 cluster
1      Mazda RX4 Mazda RX4 Wag       1  ## Two Mazda's same cluster
2      Mazda RX4      Merc 230       1  ## First Mazda with another partner
3       KIA Ceed     KIA Sport       2  ## Ceed together with Sport
4  Mazda RX4 Wag      Merc 230       1  ## Second Mazda with the same partner
5       KIA Ceed   KIA Classic       2  ## Ceed together with Classic
6        Valiant      Merc 230       3  
7    KIA Classic     KIA Sport       2  ## And of course Classic with Sport
8      Mazda RX4     Merc 240D       0  ## First Mazda with another Merc but can't be clustered together in the cluster number 1 because the second Mazda doesn't share this "partner".
9          Dacia Mazda RX4 Wag       0  ## Similar situation but just second Mazda
10      Merc 280    Merc 450SE       0
11    Duster 360       Valiant       3
12      Merc 230    Duster 360       3

It's just simple example what I would like to achieve. Of course according to my original data it may happen that some pairs of cars will be members of different clusters. Cluster numbers can be separated by coma or the another column can be created if necessary. It's not obligatory to set 0 for the pairs which cannot be clustered with anything else. They can just form a cluster with single row. I will not analyze it anyway.

I hope that I was able to explain exactly what I would like to achieve. Creative ideas are more than welcome.

Of course I would like to start the bounty for the answer which satisfies me like I did in previous thread.

A. Webb · Accepted Answer

From the comments, "I would like to create clusters which contain at least 3 different genes and all of them interact which each other."

This description would appear to correspond to the definition of a clique in graph theory. That is, you appear to be seeking all cliques of size 3 or larger.

cliques

So with your sample data

library(igraph)
g<-graph.data.frame(data,directed=FALSE)
(q<-cliques(g,min=3))
#> [[1]]
#> + 3/12 vertices, named:
#> [1] Mazda RX4     Mazda RX4 Wag Merc 230     
#> 
#> [[2]]
#> + 3/12 vertices, named:
#> [1] KIA Ceed    KIA Classic KIA Sport  
#> 
#> [[3]]
#> + 3/12 vertices, named:
#> [1] Valiant    Duster 360 Merc 230

You recognize that any edge could belong to more than one clique, so I have created one column per clique with a flag for beloning to that clique.

ind<-t(apply(data,1,function(r) sapply(q,function(i) all(as.character(r) %in% names(i)))))
(d1<-cbind(data,ind))

           Name1         Name2     1     2     3
1      Mazda RX4 Mazda RX4 Wag  TRUE FALSE FALSE
2      Mazda RX4      Merc 230  TRUE FALSE FALSE
3       KIA Ceed     KIA Sport FALSE  TRUE FALSE
4  Mazda RX4 Wag      Merc 230  TRUE FALSE FALSE
5       KIA Ceed   KIA Classic FALSE  TRUE FALSE
6        Valiant      Merc 230 FALSE FALSE  TRUE
7    KIA Classic     KIA Sport FALSE  TRUE FALSE
8      Mazda RX4     Merc 240D FALSE FALSE FALSE
9          Dacia Mazda RX4 Wag FALSE FALSE FALSE
10      Merc 280    Merc 450SE FALSE FALSE FALSE
11    Duster 360       Valiant FALSE FALSE  TRUE
12      Merc 230    Duster 360 FALSE FALSE  TRUE

Or, you could present them in a list in each row of the data.frame.

(d2<-cbind(data,clique=I(as.list(apply(ind,1,which)))))


           Name1         Name2 clique
1      Mazda RX4 Mazda RX4 Wag      1
2      Mazda RX4      Merc 230      1
3       KIA Ceed     KIA Sport      2
4  Mazda RX4 Wag      Merc 230      1
5       KIA Ceed   KIA Classic      2
6        Valiant      Merc 230      3
7    KIA Classic     KIA Sport      2
8      Mazda RX4     Merc 240D       
9          Dacia Mazda RX4 Wag       
10      Merc 280    Merc 450SE       
11    Duster 360       Valiant      3
12      Merc 230    Duster 360      3

Tensibai · Answer

Probably far less efficient, but posting for advice/remarks:

cars <- structure(list(Name1 = c("Mazda RX4", "Mazda RX4", "KIA Ceed", 
 "Mazda RX4 Wag", "KIA Ceed", "Valiant", "KIA Classic", "Mazda RX4", 
 "Dacia", "Merc 280", "Duster 360", "Merc 230"), Name2 = c("Mazda RX4 Wag", 
 "Merc 230", "KIA Sport", "Merc 230", "KIA Classic", "Merc 230", 
 "KIA Sport", "Merc 240D", "Mazda RX4 Wag", "Merc 450SE", "Valiant", 
 "Duster 360")), .Names = c("Name1", "Name2"), class = "data.frame", row.names = c(NA, 
 12L))

# Add the cluster number column to df, first row being cluster 1
cars$cluster <- c(1,rep(0,nrow(cars)-1))

# First cluster, we have to start somewhere
clusters <- list(c(paste0(cars$Name1[1]),paste0(cars$Name2[1]))) # Side note, use of paste0 for a readable output in case of factorized dataframe


# Now the ugly part, loop over the df starting at row 2
for (i in 2:nrow(cars)) {
  # Get the cars name in a more easy variable name
  c1 <- paste0(cars$Name1[i])
  c2 <- paste0(cars$Name2[i])
  # boolean to know if a new cluster have to be created
  found <- F

  # Check if first car is referenced somewhere in a cluster
  if (c1 %in% unlist(clusters)) {
    # It is, loop over the cluster list to find in wich
    for (j in 1:length(clusters)) {
      cl <- clusters[[j]] # Same shortcut var for the cluster
      if (c1 %in% cl) { # Find which cluster c1 is part of
        others <- cl[cl != c1] # Now get the other cluster members
        # Now check if the partner exists in df with relation to 1 of the others 
        if ( any( (cars$Name1 %in% others & cars$Name2 == c2)  
                | (cars$Name2 %in% others & cars$Name1 == c2) 

             )
           )
        {
          if (!c2 %in% cl) {
            clusters[[j]] <- append(cl,c2) # Update the cluster with partner car if not already present

          }
          found <- T # Set the boolean
          break # We can stop looping in the cluster list now
        }

      }
    }
  } else if (c2 %in% unlist(clusters)) { # Same as previous block with c1 and c2 swapped
    for (j in 1:length(clusters)) {
      cl <- clusters[[j]]
      if (c2 %in% cl) {
        others <- cl[cl != c2]
        if ( any( (cars$Name1 %in% others & cars$Name2 == c1) 
                | (cars$Name2 %in% others & cars$Name1 == c1) 
             )
           )
        {
          if (!c1 %in% cl) {
            clusters[[j]] <- append(cl,c1)
          }
          found <- T
          break
        }
      }
    }
  }
  # If the pair could be related to a cluster, update the df and got to next row
  if (found == T) { 
    cars$cluster[i] <- j
    next
  }
  # We didn't found a match, just create a new cluster with the pair and udpate the df
  clusters[[length(clusters)+1]] <- c(c1,c2)
  cars$cluster[i] <- length(clusters)
}

Wich output:

> cars
           Name1         Name2 cluster
1      Mazda RX4 Mazda RX4 Wag       1
2      Mazda RX4      Merc 230       1
3       KIA Ceed     KIA Sport       2
4  Mazda RX4 Wag      Merc 230       1
5       KIA Ceed   KIA Classic       2
6        Valiant      Merc 230       3
7    KIA Classic     KIA Sport       2
8      Mazda RX4     Merc 240D       4
9          Dacia Mazda RX4 Wag       5
10      Merc 280    Merc 450SE       6
11    Duster 360       Valiant       3
12      Merc 230    Duster 360       3

Or in cluster list view:

> clusters
[[1]]
[1] "Mazda RX4"     "Mazda RX4 Wag" "Merc 230"     

[[2]]
[1] "KIA Ceed"    "KIA Sport"   "KIA Classic"

[[3]]
[1] "Valiant"    "Merc 230"   "Duster 360"

[[4]]
[1] "Mazda RX4" "Merc 240D"

[[5]]
[1] "Dacia"         "Mazda RX4 Wag"

[[6]]
[1] "Merc 280"   "Merc 450SE"

Building the clusters based on the "partnership"

Tags:

r

Shaxi Liver

2 Answers

A. Webb

Tensibai

Recent Activity

Donate For Us

Building the clusters based on the "partnership"

Tags:

r

Shaxi Liver

2 Answers

A. Webb

Tensibai

Related questions

Recent Activity

Donate For Us