I am new to R and can't figure out what I might be doing wrong in the code below and how I could speed it up. I have a dataset and would like to add a column containing average value calculated from two column of data. Please take a look at the code below (WARNING: it could take some time to read my question but the code runs fine in R):
first let me define a dataset df (again I apologize for the long description of the code)
> df<-data.frame(prediction=sample(c(0,1),10,TRUE),subject=sample(c("car","dog","man","tree","book"),10,TRUE))
> df
prediction subject
1 0 man
2 1 dog
3 0 man
4 1 tree
5 1 car
6 1 tree
7 1 dog
8 0 tree
9 1 tree
10 1 tree
Next I add a the new column called subjectRate to df
df$subjectRate <- with(df,ave(prediction,subject))
> df
prediction subject subjectRate
1 0 man 0.0
2 1 dog 1.0
3 0 man 0.0
4 1 tree 0.8
5 1 car 1.0
6 1 tree 0.8
7 1 dog 1.0
8 0 tree 0.8
9 1 tree 0.8
10 1 tree 0.8
from the new table definition I generate a rateMap so as to automatically fill in new data with the subjectRate column initialized with the previously obtained average.
rateMap <- df[!duplicated(df[, c("subjectRate")]), c("subject","subjectRate")]
> rateMap
subject subjectRate
1 man 0.0
2 dog 1.0
4 tree 0.8
Now I am defining a new dataset with a combination of the old subject in df and new subjects
> dfNew<-data.frame(prediction=sample(c(0,1),15,TRUE),subject=sample(c("car","dog","man","cat","book","computer"),15,TRUE))
> dfNew
prediction subject
1 1 man
2 0 cat
3 1 computer
4 0 dog
5 0 book
6 1 cat
7 1 car
8 0 book
9 0 computer
10 1 dog
11 0 cat
12 0 book
13 1 dog
14 1 man
15 1 dog
My question: How do I create the third column efficiently? currently I am running the test below where I look up the subject rate in the map and input the value if found, or 0.5 if not.
> all_facts<-levels(factor(rateMap$subject))
> dfNew$subjectRate <- sapply(dfNew$subject,function(t) ifelse(t %in% all_facts,rateMap[as.character(rateMap$subject) == as.character(t),][1,"subjectRate"],0.5))
> dfNew
prediction subject subjectRate
1 1 man 0.0
2 0 cat 0.5
3 1 computer 0.5
4 0 dog 1.0
5 0 book 0.5
6 1 cat 0.5
7 1 car 0.5
8 0 book 0.5
9 0 computer 0.5
10 1 dog 1.0
11 0 cat 0.5
12 0 book 0.5
13 1 dog 1.0
14 1 man 0.0
15 1 dog 1.0
but with a real dataset (more than 200,000 rows) with multiple columns similar to subject to compute the average, the code takes a very long time to run. Can somebody suggest maybe a better way to do what I am trying to achieve? maybe some merge or something, but I am out of ideas. Thank you.
I suspect (but am not sure, since I haven't tested it) that this will be faster:
dfNew$subjectRate <- rateMap$subjectRate[match(dfNew$subject,rateMap$subject)]
since it mostly uses just indexing and match. It certainly a bit simpler, I think. This will fill in the "new" values with NAs, rather than 0.5, which can then be filled in however you like with,
dfNew$subjectRate[is.na(dfNew$subjectRate)] <- newValue
If the ave piece is particularly slow, the standard recommendation these days is to use the data.table package:
require(data.table)
dft <- as.data.table(df)
setkeyv(dft, "subject")
dft[, subjectRate := mean(prediction), by = subject]
and this will probably attract a few comments suggesting ways to eke a bit more speed out of that data table aggregation in the last line. Indeed, merging or joining using pure data.tables may be even slicker (and fast), so you might want to investigate that option as well. (See the very bottom of ?data.table for a bunch of examples.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With