I have a two dimensional table with distances in a data.frame in R (imported from csv):
CP000036 CP001063 CP001368
CP000036 0 a b
CP001063 a 0 c
CP001368 b c 0
I'd like to "flatten" it. that I have one axes's value in the first col, and the other axes's value in the second col, and then the distance in the third col:
Genome1 Genome2 Dist
CP000036 CP001063 a
CP000036 CP001368 b
CP001063 CP001368 c
Above is ideal, but it would be completely fine to have repetition such that each cell in the input matrix has it's own row:
Genome1 Genome2 Dist
CP000036 CP000036 0
CP000036 CP001063 a
CP000036 CP001368 b
CP001063 CP000036 a
CP001063 CP001063 0
CP001063 CP001368 c
CP001368 CP000036 b
CP001368 CP001063 c
CP001368 CP001368 0
Here is an example 3x3 matrix, but my dataset I is much larger (about 2000x2000). I would do this in Excel, but I need ~3 million rows for the output, whereas Excel's maximum is ~1 million.
This question is very similar to "How to “flatten” or “collapse” a 2D Excel table into 1D?"1
So this is one solution using melt
from the package reshape2
:
dm <-
data.frame( CP000036 = c( "0", "a", "b" ),
CP001063 = c( "a", "0", "c" ),
CP001368 = c( "b", "c", "0" ),
stringsAsFactors = FALSE,
row.names = c( "CP000036", "CP001063", "CP001368" ) )
# assuming the distance follows a metric we avoid everything below and on the diagonal
dm[ lower.tri( dm, diag = TRUE ) ] <- NA
dm$Genome1 <- rownames( dm )
# finally melt and avoid the entries below the diagonal with na.rm = TRUE
library(reshape2)
dm.molten <- melt( dm, na.rm= TRUE, id.vars="Genome1",
value.name="Dist", variable.name="Genome2" )
print( dm.molten )
Genome1 Genome2 Dist
4 CP000036 CP001063 a
7 CP000036 CP001368 b
8 CP001063 CP001368 c
Probably there are more performant solutions but I like this one because it's plain and simple.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With