I love the data.table package in R, and I think it could help me perform sophisticated cross tabulation tasks, but haven't figured out how to use the package to do tasks similar to table
.
Here's some replication survey data:
opinion <- c("gov", "market", "gov", "gov")
ID <- c("resp1", "resp2", "resp3", "resp4")
party <- c("GOP", "GOP", "democrat", "GOP")
df <- data.frame(ID, opinion, party)
In tables, counting the number of opinions by party is as simple as table(df$opinion, df$party).
I've managed to do something similar in data.table, but the result is clunky and it adds a separate column.
dt <- data.table(df)
dt[, .N, by="party"]
There's a number of grouping operations in data.table that could be great for fast and sophisticated crosstabs of survey data, but i haven't found any tutorials on how to it. Thanks for any help.
Two-way tables are also known as contingency, cross-tabulation, or crosstab tables. The levels of one categorical variable are entered as the rows in the table and the levels of the other categorical variable are entered as the columns in the table.
For a precise reference, a cross-tabulation is a two- (or more) dimensional table that records the number (frequency) of respondents that have the specific characteristics described in the cells of the table. Cross-tabulation tables provide a wealth of information about the relationship between the variables.
We can use dcast
from data.table
(See the Efficient reshaping using data.tables vignette on the project wiki or on the CRAN project page).
dcast(dt, opinion~party, value.var='ID', length)
If we use a slightly bigger dataset and compare the speed using dcast
from reshape2
and data.table
set.seed(24)
df <- data.frame(ID=1:1e6, opinion=sample(letters, 1e6, replace=TRUE),
party= sample(1:9, 1e6, replace=TRUE))
system.time(dcast(df, opinion ~ party, value.var='ID', length))
# user system elapsed
# 0.278 0.013 0.293
system.time(dcast(setDT(df), opinion ~ party, value.var='ID', length))
# user system elapsed
# 0.022 0.000 0.023
system.time(setDT(df)[, .N, by = .(opinion, party)])
# user system elapsed
# 0.018 0.001 0.018
The third option is slightly better but it is in 'long' format. If the OP wants to have a 'wide' format, the data.table
dcast
can be used.
NOTE: I am using the the devel version i.e. v1.9.7
, but the CRAN should be fast enough.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With