Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find frequencies of combinations where the data.frame needs to be parsed

I'm sure there's a simple solution to this, but I can't figure it out!! Suppose I have a dataframe that has the following information:

aaa<-c("A,B","B,C","B,D,E")
vvv<-c("101","101,102","102,103,104")
data_h<-data.frame(aaa,vvv)
data_h
    aaa         vvv
1   A,B         101
2   B,C     101,102
3 B,D,E 102,103,104

Desired output is a frequency map of individual hits, for subsequent analysis in a heat map. So:

  101   102   103   104
A  1     0     0     0
B  2     2     1     1
C  1     1     0     0
D  0     1     1     1
E  0     1     1     1

How do I make this transformation? I've seen many similar examples, but none where the contents of the data-frame need to be parsed.

The goal is to ultimately use heatmap or something similar on the output table to visualize the correlation between "aaa" and "vvv".

like image 758
Amit Kohli Avatar asked Dec 19 '22 14:12

Amit Kohli


2 Answers

Here is a base R solution in 4 lines of code. First we define a function, spl which splits the components of a comma separated string producing a vector of all the fields. eg takes two string arguments and applies spl to each of them and then creates a grid of the result of the splitting. Finally we apply eg to each row of data_h, rbind the results together and tabulate them with xtabs:

spl <- function(x) strsplit(as.character(x), ",")[[1]]
eg <- function(aaa, vvv) expand.grid(aaa = spl(aaa), vvv = spl(vvv))
dd <- do.call("rbind", Map(eg, data_h$aaa, data_h$vvv))
xtabs(data = dd)

The result is:

   vvv
aaa 101 102 103 104
  A   1   0   0   0
  B   2   2   1   1
  C   1   1   0   0
  D   0   1   1   1
  E   0   1   1   1

dcast Alternately replace the last line of code above (the one with the xtabs) with:

library(reshape2)
dcast(dd, aaa ~ vvv, fun = length, value.var = "vvv")

in which case the result is:

  aaa 101 102 103 104
1   A   1   0   0   0
2   B   2   2   1   1
3   C   1   1   0   0
4   D   0   1   1   1
5   E   0   1   1   1

tapply. Another alternative would be tapply (however, it will fill in empty cells with NA rather than 0):

tapply(1:nrow(dd), dd, length)

ADDED Alternatives. Some improvements.

like image 141
G. Grothendieck Avatar answered Dec 28 '22 07:12

G. Grothendieck


The shape of the data.frame suggests using splitstackshape package. But I don't know very well this package so I just use it to reshape the data, and then compute frequencies by hand using table:

library(splitstackshape)
data_h_split <- concat.split.multiple(data_h,1:2)

# aaa_1 aaa_2 aaa_3 vvv_1 vvv_2 vvv_3
# 1     A     B  <NA>   101    NA    NA
# 2     B     C  <NA>   101   102    NA
# 3     B     D     E   102   103   104

Once you have the data in this format (no comma , regular columns), it is easy to compute frequencies using table( you can use tapply,reshape):

table(cbind.data.frame(ff= unlist(data_h_split[1:3]),
                       xx= unlist(data_h_split[4:6])))
   xx
ff  101 102 103 104
  A   1   0   0   0
  B   1   1   0   0
  C   0   1   0   0
  D   0   0   1   0
      0   0   0   0
  E   0   0   0   1

Ananda's edit

Here's a multi-step approach to get the result using "splitstackshape" to work for this.

library(splitstackshape)

## Split the "vvv" column first, and reshape at the same time
x <- concat.split.multiple(data_h, split.cols="vvv", ",", "long")

## Add an ID column
x$id <- 1:nrow(x)

## Split the "aaa" column next, again reshaping as we do so
x <- concat.split.multiple(x[complete.cases(x), ], split.cols="aaa", ",", "long")

## Use `table` with `droplevels`
with(droplevels(x), table(aaa, vvv))
#    vvv
# aaa 101 102 103 104
#   A   1   0   0   0
#   B   2   2   1   1
#   C   1   1   0   0
#   D   0   1   1   1
#   E   0   1   1   1
like image 40
agstudy Avatar answered Dec 28 '22 06:12

agstudy