Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Crosstabs with data.table in R [duplicate]

I love the data.table package in R, and I think it could help me perform sophisticated cross tabulation tasks, but haven't figured out how to use the package to do tasks similar to table.

Here's some replication survey data:

opinion <- c("gov", "market", "gov", "gov")
ID <- c("resp1", "resp2", "resp3", "resp4")
party <- c("GOP", "GOP", "democrat", "GOP")

df <- data.frame(ID, opinion, party)

In tables, counting the number of opinions by party is as simple as table(df$opinion, df$party).

I've managed to do something similar in data.table, but the result is clunky and it adds a separate column.

dt <- data.table(df)
dt[, .N, by="party"]

There's a number of grouping operations in data.table that could be great for fast and sophisticated crosstabs of survey data, but i haven't found any tutorials on how to it. Thanks for any help.

like image 994
tom Avatar asked Oct 04 '15 15:10

tom


People also ask

What is twoway cross tabulation?

Two-way tables are also known as contingency, cross-tabulation, or crosstab tables. The levels of one categorical variable are entered as the rows in the table and the levels of the other categorical variable are entered as the columns in the table.

What is a cross tabulation table?

For a precise reference, a cross-tabulation is a two- (or more) dimensional table that records the number (frequency) of respondents that have the specific characteristics described in the cells of the table. Cross-tabulation tables provide a wealth of information about the relationship between the variables.


1 Answers

We can use dcast from data.table (See the Efficient reshaping using data.tables vignette on the project wiki or on the CRAN project page).

dcast(dt, opinion~party, value.var='ID', length)

Benchmarks

If we use a slightly bigger dataset and compare the speed using dcast from reshape2 and data.table

set.seed(24)
df <- data.frame(ID=1:1e6, opinion=sample(letters, 1e6, replace=TRUE),
  party= sample(1:9, 1e6, replace=TRUE))
system.time(dcast(df, opinion ~ party, value.var='ID', length))
#   user  system elapsed 
#  0.278   0.013   0.293 
system.time(dcast(setDT(df), opinion ~ party, value.var='ID', length))
#   user  system elapsed 
# 0.022   0.000   0.023 

system.time(setDT(df)[, .N, by = .(opinion, party)])
#  user  system elapsed 
# 0.018   0.001   0.018 

The third option is slightly better but it is in 'long' format. If the OP wants to have a 'wide' format, the data.table dcast can be used.

NOTE: I am using the the devel version i.e. v1.9.7, but the CRAN should be fast enough.

like image 166
akrun Avatar answered Oct 22 '22 23:10

akrun