I'm sure there's a simple solution to this, but I can't figure it out!! Suppose I have a dataframe that has the following information:
aaa<-c("A,B","B,C","B,D,E")
vvv<-c("101","101,102","102,103,104")
data_h<-data.frame(aaa,vvv)
data_h
aaa vvv
1 A,B 101
2 B,C 101,102
3 B,D,E 102,103,104
Desired output is a frequency map of individual hits, for subsequent analysis in a heat map. So:
101 102 103 104
A 1 0 0 0
B 2 2 1 1
C 1 1 0 0
D 0 1 1 1
E 0 1 1 1
How do I make this transformation? I've seen many similar examples, but none where the contents of the data-frame need to be parsed.
The goal is to ultimately use heatmap or something similar on the output table to visualize the correlation between "aaa" and "vvv".
Here is a base R solution in 4 lines of code. First we define a function, spl
which splits the components of a comma separated string producing a vector of all the fields. eg
takes two string arguments and applies spl
to each of them and then creates a grid of the result of the splitting. Finally we apply eg
to each row of data_h
, rbind
the results together and tabulate them with xtabs
:
spl <- function(x) strsplit(as.character(x), ",")[[1]]
eg <- function(aaa, vvv) expand.grid(aaa = spl(aaa), vvv = spl(vvv))
dd <- do.call("rbind", Map(eg, data_h$aaa, data_h$vvv))
xtabs(data = dd)
The result is:
vvv
aaa 101 102 103 104
A 1 0 0 0
B 2 2 1 1
C 1 1 0 0
D 0 1 1 1
E 0 1 1 1
dcast Alternately replace the last line of code above (the one with the xtabs
) with:
library(reshape2)
dcast(dd, aaa ~ vvv, fun = length, value.var = "vvv")
in which case the result is:
aaa 101 102 103 104
1 A 1 0 0 0
2 B 2 2 1 1
3 C 1 1 0 0
4 D 0 1 1 1
5 E 0 1 1 1
tapply. Another alternative would be tapply
(however, it will fill in empty cells with NA rather than 0):
tapply(1:nrow(dd), dd, length)
ADDED Alternatives. Some improvements.
The shape of the data.frame suggests using splitstackshape
package. But I don't know very well this package so I just use it to reshape the data, and then compute frequencies by hand using table
:
library(splitstackshape)
data_h_split <- concat.split.multiple(data_h,1:2)
# aaa_1 aaa_2 aaa_3 vvv_1 vvv_2 vvv_3
# 1 A B <NA> 101 NA NA
# 2 B C <NA> 101 102 NA
# 3 B D E 102 103 104
Once you have the data in this format (no comma , regular columns), it is easy to compute frequencies using table
( you can use tapply
,reshape
):
table(cbind.data.frame(ff= unlist(data_h_split[1:3]),
xx= unlist(data_h_split[4:6])))
xx
ff 101 102 103 104
A 1 0 0 0
B 1 1 0 0
C 0 1 0 0
D 0 0 1 0
0 0 0 0
E 0 0 0 1
Here's a multi-step approach to get the result using "splitstackshape" to work for this.
library(splitstackshape)
## Split the "vvv" column first, and reshape at the same time
x <- concat.split.multiple(data_h, split.cols="vvv", ",", "long")
## Add an ID column
x$id <- 1:nrow(x)
## Split the "aaa" column next, again reshaping as we do so
x <- concat.split.multiple(x[complete.cases(x), ], split.cols="aaa", ",", "long")
## Use `table` with `droplevels`
with(droplevels(x), table(aaa, vvv))
# vvv
# aaa 101 102 103 104
# A 1 0 0 0
# B 2 2 1 1
# C 1 1 0 0
# D 0 1 1 1
# E 0 1 1 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With