I have a dataset with about 3 million rows and the following structure:
PatientID| Year | PrimaryConditionGroup
---------------------------------------
1 | Y1 | TRAUMA
1 | Y1 | PREGNANCY
2 | Y2 | SEIZURE
3 | Y1 | TRAUMA
Being fairly new to R, I have some trouble finding the right way to reshape the data into the structure outlined below:
PatientID| Year | TRAUMA | PREGNANCY | SEIZURE
----------------------------------------------
1 | Y1 | 1 | 1 | 0
2 | Y2 | 0 | 0 | 1
3 | Y1 | 1 | 0 | 1
My question is: What is the fastest/most elegant way to create a data.frame, where the values of PrimaryConditionGroup become columns, grouped by PatientID and Year (counting the number of occurences)?
There are probably more succinct ways of doing this, but for sheer speed, it's hard to beat a data.table
-based solution:
df <- read.table(text="PatientID Year PrimaryConditionGroup
1 Y1 TRAUMA
1 Y1 PREGNANCY
2 Y2 SEIZURE
3 Y1 TRAUMA", header=T)
library(data.table)
dt <- data.table(df, key=c("PatientID", "Year"))
dt[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
by = list(PatientID, Year)]
# PatientID Year TRAUMA PREGNANCY SEIZURE
# [1,] 1 Y1 1 1 0
# [2,] 2 Y2 0 0 1
# [3,] 3 Y1 1 0 0
EDIT: aggregate()
provides a 'base R' solution that might or might not be more idiomatic. (The sole complication is that aggregate returns a matrix, rather than a data.frame; the second line below fixes that up.)
out <- aggregate(PrimaryConditionGroup ~ PatientID + Year, data=df, FUN=table)
out <- cbind(out[1:2], data.frame(out[3][[1]]))
2nd EDIT Finally, a succinct solution using the reshape
package gets you to the same place.
library(reshape)
mdf <- melt(df, id=c("PatientID", "Year"))
cast(PatientID + Year ~ value, data=j, fun.aggregate=length)
There are fast melt
and dcast
data.table specific methods implemented in C, in versions >=1.9.0
. Here's a comparison with other excellent answers from @Josh's post on 3-million row data (just excluding base:::aggregate as it was taking quite sometime).
For more info on NEWS entry, go here.
I'll assume you've 1000 patients and 5 years in total. You can adjust the variables patients
and year
accordingly.
require(data.table) ## >= 1.9.0
require(reshape2)
set.seed(1L)
patients = 1000L
year = 5L
n = 3e6L
condn = c("TRAUMA", "PREGNANCY", "SEIZURE")
# dummy data
DT <- data.table(PatientID = sample(patients, n, TRUE),
Year = sample(year, n, TRUE),
PrimaryConditionGroup = sample(condn, n, TRUE))
DT_dcast <- function(DT) {
dcast.data.table(DT, PatientID ~ Year, fun.aggregate=length)
}
reshape2_dcast <- function(DT) {
reshape2:::dcast(DT, PatientID ~ Year, fun.aggregate=length)
}
DT_raw <- function(DT) {
DT[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
by = list(PatientID, Year)]
}
# system.time(.) timed 3 times
# Method Time_rep1 Time_rep2 Time_rep3
# dcast_DT 0.393 0.399 0.396
# reshape2_DT 3.784 3.457 3.605
# DT_raw 0.647 0.680 0.657
dcast.data.table
is about 1.6x faster than normal aggregation using data.table
and 8.8x faster than reshape2:::dcast
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With