I am trying to clean data using ddply but it is running very slowly on 1.3M rows.
Sample code:
#Create Sample Data Frame
num_rows <- 10000
df <- data.frame(id=sample(1:20, num_rows, replace=T),
Consumption=sample(-20:20, num_rows, replace=T),
StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
#df <- df[order(df$id, df$StartDate, df$Consumption),]
#Are values negative?
# Needed for subsetting in ddply rows with same positive and negative values
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)
I have written a function to remove rows where there is a consumption value in one row that is identical but negative to a consumption value in another row (for the same id).
#Remove rows from a data frame where there is an equal but opposite consumption value
#Should ensure only one negative value is removed for each positive one.
clean_negatives <- function(x3){
copies <- abs(sum(x3$Neg))
sgn <- ifelse(sum(x3$Neg) <0, -1, 1)
x3 <- x3[0:copies,]
x3$Consumption <- sgn*x3$Consumption
x3$Neg <- NULL
x3}
I then use ddply to apply that function to remove these erroneous rows in the data
ptm <- proc.time()
df_cleaned <- ddply(df, .(id,StartDate, EndDate, Consumption),
function(x){clean_negatives(x)})
proc.time() - ptm
I was hoping I could use data.table to make this go faster but I couldn't work out how to employ data.table to help.
With 1.3M rows, so far it is taking my desktop all day to compute and still hasn't finished.
Your question asks about data.table implementation. So, I've shown it here. Your function could be drastically simplified as well. You can first get the sign by summing up Neg and then filter the table and then multiply Consumption by sign (as shown below).
require(data.table)
# get the data.table in dt
dt <- data.table(df, key = c("id", "StartDate", "EndDate", "Consumption"))
# first obtain the sign directly
dt <- dt[, sign := sign(sum(Neg)), by = c("id", "StartDate", "EndDate", "Consumption")]
# then filter by abs(sum(Neg))
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], by = c("id", "StartDate", "EndDate", "Consumption")]
# modifying for final output (line commented after Statquant's comment
# dt.fil$Consumption <- dt.fil$Consumption * dt.fil$sign
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))
Benchmarking
The data with million rows:
#Create Sample Data Frame
num_rows <- 1e6
df <- data.frame(id=sample(1:20, num_rows, replace=T),
Consumption=sample(-20:20, num_rows, replace=T),
StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)
The data.table function:
FUN.DT <- function() {
require(data.table)
dt <- data.table(df, key=c("id", "StartDate", "EndDate", "Consumption"))
dt <- dt[, sign := sign(sum(Neg)),
by = c("id", "StartDate", "EndDate", "Consumption")]
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))],
by=c("id", "StartDate", "EndDate", "Consumption")]
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))
}
Your function with ddply
FUN.PLYR <- function() {
require(plyr)
clean_negatives <- function(x3) {
copies <- abs(sum(x3$Neg))
sgn <- ifelse(sum(x3$Neg) <0, -1, 1)
x3 <- x3[0:copies,]
x3$Consumption <- sgn*x3$Consumption
x3$Neg <- NULL
x3
}
df_cleaned <- ddply(df, .(id, StartDate, EndDate, Consumption),
function(x) clean_negatives(x))
}
Benchmarking with rbenchmark (with 1 run only)
require(rbenchmark)
benchmark(FUN.DT(), FUN.PLYR(), replications = 1, order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
1 FUN.DT() 1 6.137 1.000 5.926 0.211 0 0
2 FUN.PLYR() 1 242.268 39.477 152.855 82.881 0 0
My data.table implementation is about 39 times faster than your current plyr implementation (I compare mine to your implementation because the functions are different).
Note: I loaded the packages within the function in order to obtain the complete time to obtain the result. Also, for the same reason I converted the data.frame to data.table with keys inside the benchmarking function. This is therefore the minimum speed-up.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With