I have a list in which each list item is a word frequency table derived from using "table()" on a different sample text. Each table is, therefore, a different length. I want to now convert the list into a single data frame in which each column is a word each row is a sample text. Here is a dummy example of my data:
t1<-table(strsplit(tolower("this is a test in the event of a real word file you would see many more words here"), "\\W"))
t2<-table(strsplit(tolower("Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal"), "\\W"))
t3<-table(strsplit(tolower("Ask not what your country can do for you - ask what you can do for your country"), "\\W"))
myList <- list(t1, t2, t3)
so, one gets this sort of structure:
> class(myList[[3]])
[1] "table"
> myList[[3]]
ask can country do for not what you your
2 2 2 2 2 2 1 2 2 2
I now need to convert this list (myList) into a single data frame. I thought I could do this with plyr, along the lines of what is done here (http://ryouready.wordpress.com/2009/01/23/r-combining-vectors-or-data-frames-of-unequal-length-into-one-data-frame/), e.g.
library(plyr)
l <- myList
do.call(rbind.fill, l)
But it seems that my "table" objects do not play nice. I tried converting them to dfs and also to vectors, but none of that worked quite right.
1. zoo. The zoo package has a multiway merge function which can do this compactly. The lapply
converts each component of myList
to a zoo object and then we simply merge them all:
# optionally add nice names to the list
names(myList) <- paste("t", seq_along(myList), sep = "")
library(zoo)
fz <- function(x)with(as.data.frame(x, stringsAsFactors=FALSE), zoo(Freq, Var1)))
out <- do.call(merge, lapply(myList, fz))
The above returns a multivariate zoo series in which the "times" are "a"
, "ago"
, etc. but if a data frame result were desired then its just a matter of as.data.frame(out)
.
2. Reduce. Here is a second solution. It uses Reduce
in the core of R.
merge1 <- function(x, y) merge(x, y, by = 1, all = TRUE)
out <- Reduce(merge1, lapply(myList, as.data.frame, stringsAsFactors = FALSE))
# optionally add nice names
colnames(out)[-1] <- paste("t", seq_along(myList), sep = "")
3. xtabs. This one adds names to the list and then extracts the frequencies, names and groups as one long vector each putting them back together using xtabs
:
names(myList) <- paste("t", seq_along(myList))
xtabs(Freq ~ Names + Group, data.frame(
Freq = unlist(lapply(myList, unname)),
Names = unlist(lapply(myList, names)),
Group = rep(names(myList), sapply(myList, length))
))
Benchmark
Benchmarking some of the solutions using the rbenchmark package we get the following which indicates that the zoo solution is the fastest on the sample data and is arguably the simplest as well.
> t1<-table(strsplit(tolower("this is a test in the event of a real word file you would see many more words here"), "\\W"))
> t2<-table(strsplit(tolower("Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal"), "\\W"))
> t3<-table(strsplit(tolower("Ask not what your country can do for you - ask what you can do for your country"), "\\W"))
> myList <- list(t1, t2, t3)
>
> library(rbenchmark)
> library(zoo)
> names(myList) <- paste("t", seq_along(myList), sep = "")
>
> benchmark(xtabs = {
+ names(myList) <- paste("t", seq_along(myList))
+ xtabs(Freq ~ Names + Group, data.frame(
+ Freq = unlist(lapply(myList, unname)),
+ Names = unlist(lapply(myList, names)),
+ Group = rep(names(myList), sapply(myList, length))
+ ))
+ },
+ zoo = {
+ fz <- function(x) with(as.data.frame(x, stringsAsFactors=FALSE), zoo(Freq, Var1))
+ do.call(merge, lapply(myList, fz))
+ },
+ Reduce = {
+ merge1 <- function(x, y) merge(x, y, by = 1, all = TRUE)
+ Reduce(merge1, lapply(myList, as.data.frame, stringsAsFactors = FALSE))
+ },
+ reshape = {
+ freqs.list <- mapply(data.frame,Words=seq_along(myList),myList,SIMPLIFY=FALSE,MoreArgs=list(stringsAsFactors=FALSE))
+ freqs.df <- do.call(rbind,freqs.list)
+ reshape(freqs.df,timevar="Words",idvar="Var1",direction="wide")
+ }, replications = 10, order = "relative", columns = c("test", "replications", "relative"))
test replications relative
2 zoo 10 1.000000
4 reshape 10 1.090909
1 xtabs 10 1.272727
3 Reduce 10 1.272727
ADDED: second solution.
ADDED: third solution.
ADDED: benchmark.
freqs.list <- mapply(data.frame,Words=seq_along(myList),myList,SIMPLIFY=FALSE,MoreArgs=list(stringsAsFactors=FALSE))
freqs.df <- do.call(rbind,freqs.list)
res <- reshape(freqs.df,timevar="Words",idvar="Var1",direction="wide")
head(res)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With