Some time ago I did similar thread but unfortunately the approach which I've used there didn't give me any promising results. I came with an idea how to do it in a different way. So here I am.
Of course example data:
structure(list(Name1 = c("Mazda RX4", "Mazda RX4", "KIA Ceed",
"Mazda RX4 Wag", "KIA Ceed", "Valiant", "KIA Classic", "Mazda RX4",
"Dacia", "Merc 280", "Duster 360", "Merc 230"), Name2 = c("Mazda RX4 Wag",
"Merc 230", "KIA Sport", "Merc 230", "KIA Classic", "Merc 230",
"KIA Sport", "Merc 240D", "Mazda RX4 Wag", "Merc 450SE", "Valiant",
"Duster 360")), .Names = c("Name1", "Name2"), class = "data.frame", row.names = c(NA,
12L))
This data frame contains only two columns. The original data has more but this time I will focus only on those columns.
Just to show how exactly I would like to cluster those guys I will put a desired output:
structure(list(Name1 = c("Mazda RX4", "Mazda RX4", "KIA Ceed",
"Mazda RX4 Wag", "KIA Ceed", "Valiant", "KIA Classic", "Mazda RX4",
"Dacia", "Merc 280", "Duster 360", "Merc 230"), Name2 = c("Mazda RX4 Wag",
"Merc 230", "KIA Sport", "Merc 230", "KIA Classic", "Merc 230",
"KIA Sport", "Merc 240D", "Mazda RX4 Wag", "Merc 450SE", "Valiant",
"Duster 360"), cluster = c(1, 1, 2, 1, 2, 3, 2, 0, 0, 0, 3, 3
)), .Names = c("Name1", "Name2", "cluster"), row.names = c(NA,
12L), class = "data.frame")
As you can see from the ouput I would like to cluster cars depending on the partner which can be find in the second column. So if the cars in one row shares the same "partner" in the next column they should be clustered together.
And how it looks in the table form with a little bit of explanation:
Name1 Name2 cluster
1 Mazda RX4 Mazda RX4 Wag 1 ## Two Mazda's same cluster
2 Mazda RX4 Merc 230 1 ## First Mazda with another partner
3 KIA Ceed KIA Sport 2 ## Ceed together with Sport
4 Mazda RX4 Wag Merc 230 1 ## Second Mazda with the same partner
5 KIA Ceed KIA Classic 2 ## Ceed together with Classic
6 Valiant Merc 230 3
7 KIA Classic KIA Sport 2 ## And of course Classic with Sport
8 Mazda RX4 Merc 240D 0 ## First Mazda with another Merc but can't be clustered together in the cluster number 1 because the second Mazda doesn't share this "partner".
9 Dacia Mazda RX4 Wag 0 ## Similar situation but just second Mazda
10 Merc 280 Merc 450SE 0
11 Duster 360 Valiant 3
12 Merc 230 Duster 360 3
It's just simple example what I would like to achieve. Of course according to my original data it may happen that some pairs of cars will be members of different clusters. Cluster numbers can be separated by coma or the another column can be created if necessary. It's not obligatory to set 0 for the pairs which cannot be clustered with anything else. They can just form a cluster with single row. I will not analyze it anyway.
I hope that I was able to explain exactly what I would like to achieve. Creative ideas are more than welcome.
Of course I would like to start the bounty for the answer which satisfies me like I did in previous thread.
From the comments, "I would like to create clusters which contain at least 3 different genes and all of them interact which each other."
This description would appear to correspond to the definition of a clique in graph theory. That is, you appear to be seeking all cliques of size 3 or larger.
So with your sample data
library(igraph)
g<-graph.data.frame(data,directed=FALSE)
(q<-cliques(g,min=3))
#> [[1]]
#> + 3/12 vertices, named:
#> [1] Mazda RX4 Mazda RX4 Wag Merc 230
#>
#> [[2]]
#> + 3/12 vertices, named:
#> [1] KIA Ceed KIA Classic KIA Sport
#>
#> [[3]]
#> + 3/12 vertices, named:
#> [1] Valiant Duster 360 Merc 230
You recognize that any edge could belong to more than one clique, so I have created one column per clique with a flag for beloning to that clique.
ind<-t(apply(data,1,function(r) sapply(q,function(i) all(as.character(r) %in% names(i)))))
(d1<-cbind(data,ind))
Name1 Name2 1 2 3 1 Mazda RX4 Mazda RX4 Wag TRUE FALSE FALSE 2 Mazda RX4 Merc 230 TRUE FALSE FALSE 3 KIA Ceed KIA Sport FALSE TRUE FALSE 4 Mazda RX4 Wag Merc 230 TRUE FALSE FALSE 5 KIA Ceed KIA Classic FALSE TRUE FALSE 6 Valiant Merc 230 FALSE FALSE TRUE 7 KIA Classic KIA Sport FALSE TRUE FALSE 8 Mazda RX4 Merc 240D FALSE FALSE FALSE 9 Dacia Mazda RX4 Wag FALSE FALSE FALSE 10 Merc 280 Merc 450SE FALSE FALSE FALSE 11 Duster 360 Valiant FALSE FALSE TRUE 12 Merc 230 Duster 360 FALSE FALSE TRUE
Or, you could present them in a list in each row of the data.frame.
(d2<-cbind(data,clique=I(as.list(apply(ind,1,which)))))
Name1 Name2 clique 1 Mazda RX4 Mazda RX4 Wag 1 2 Mazda RX4 Merc 230 1 3 KIA Ceed KIA Sport 2 4 Mazda RX4 Wag Merc 230 1 5 KIA Ceed KIA Classic 2 6 Valiant Merc 230 3 7 KIA Classic KIA Sport 2 8 Mazda RX4 Merc 240D 9 Dacia Mazda RX4 Wag 10 Merc 280 Merc 450SE 11 Duster 360 Valiant 3 12 Merc 230 Duster 360 3
Probably far less efficient, but posting for advice/remarks:
cars <- structure(list(Name1 = c("Mazda RX4", "Mazda RX4", "KIA Ceed",
"Mazda RX4 Wag", "KIA Ceed", "Valiant", "KIA Classic", "Mazda RX4",
"Dacia", "Merc 280", "Duster 360", "Merc 230"), Name2 = c("Mazda RX4 Wag",
"Merc 230", "KIA Sport", "Merc 230", "KIA Classic", "Merc 230",
"KIA Sport", "Merc 240D", "Mazda RX4 Wag", "Merc 450SE", "Valiant",
"Duster 360")), .Names = c("Name1", "Name2"), class = "data.frame", row.names = c(NA,
12L))
# Add the cluster number column to df, first row being cluster 1
cars$cluster <- c(1,rep(0,nrow(cars)-1))
# First cluster, we have to start somewhere
clusters <- list(c(paste0(cars$Name1[1]),paste0(cars$Name2[1]))) # Side note, use of paste0 for a readable output in case of factorized dataframe
# Now the ugly part, loop over the df starting at row 2
for (i in 2:nrow(cars)) {
# Get the cars name in a more easy variable name
c1 <- paste0(cars$Name1[i])
c2 <- paste0(cars$Name2[i])
# boolean to know if a new cluster have to be created
found <- F
# Check if first car is referenced somewhere in a cluster
if (c1 %in% unlist(clusters)) {
# It is, loop over the cluster list to find in wich
for (j in 1:length(clusters)) {
cl <- clusters[[j]] # Same shortcut var for the cluster
if (c1 %in% cl) { # Find which cluster c1 is part of
others <- cl[cl != c1] # Now get the other cluster members
# Now check if the partner exists in df with relation to 1 of the others
if ( any( (cars$Name1 %in% others & cars$Name2 == c2)
| (cars$Name2 %in% others & cars$Name1 == c2)
)
)
{
if (!c2 %in% cl) {
clusters[[j]] <- append(cl,c2) # Update the cluster with partner car if not already present
}
found <- T # Set the boolean
break # We can stop looping in the cluster list now
}
}
}
} else if (c2 %in% unlist(clusters)) { # Same as previous block with c1 and c2 swapped
for (j in 1:length(clusters)) {
cl <- clusters[[j]]
if (c2 %in% cl) {
others <- cl[cl != c2]
if ( any( (cars$Name1 %in% others & cars$Name2 == c1)
| (cars$Name2 %in% others & cars$Name1 == c1)
)
)
{
if (!c1 %in% cl) {
clusters[[j]] <- append(cl,c1)
}
found <- T
break
}
}
}
}
# If the pair could be related to a cluster, update the df and got to next row
if (found == T) {
cars$cluster[i] <- j
next
}
# We didn't found a match, just create a new cluster with the pair and udpate the df
clusters[[length(clusters)+1]] <- c(c1,c2)
cars$cluster[i] <- length(clusters)
}
Wich output:
> cars
Name1 Name2 cluster
1 Mazda RX4 Mazda RX4 Wag 1
2 Mazda RX4 Merc 230 1
3 KIA Ceed KIA Sport 2
4 Mazda RX4 Wag Merc 230 1
5 KIA Ceed KIA Classic 2
6 Valiant Merc 230 3
7 KIA Classic KIA Sport 2
8 Mazda RX4 Merc 240D 4
9 Dacia Mazda RX4 Wag 5
10 Merc 280 Merc 450SE 6
11 Duster 360 Valiant 3
12 Merc 230 Duster 360 3
Or in cluster list view:
> clusters
[[1]]
[1] "Mazda RX4" "Mazda RX4 Wag" "Merc 230"
[[2]]
[1] "KIA Ceed" "KIA Sport" "KIA Classic"
[[3]]
[1] "Valiant" "Merc 230" "Duster 360"
[[4]]
[1] "Mazda RX4" "Merc 240D"
[[5]]
[1] "Dacia" "Mazda RX4 Wag"
[[6]]
[1] "Merc 280" "Merc 450SE"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With