Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregating in R

I have a data frame with two columns. I want to add an additional two columns to the data set with counts based on aggregates.

df <- structure(list(ID = c(1045937900, 1045937900), 
SMS.Type = c("DF1", "WCB14"), 
SMS.Date = c("12/02/2015 19:51", "13/02/2015 08:38"), 
Reply.Date = c("", "13/02/2015 09:52")
), row.names = 4286:4287, class = "data.frame")

I want to simply count the number of Instances of SMS.Type and Reply.Date where there is no null. So in the toy example below, i will generate the 2 for SMS.Type and 1 for Reply.Date

I then want to add this to the data frame as total counts (Im aware they will duplicate out for the number of rows in the original dataset but thats ok)

I have been playing around with aggregate and count function but to no avail

mytempdf <-aggregate(cbind(testtrain$SMS.Type,testtrain$Response.option)~testtrain$ID,
                  train, 
                  function(x) length(unique(which(!is.na(x)))))

mytempdf <- aggregate(testtrain$Reply.Date~testtrain$ID,
                  testtrain, 
                  function(x) length(which(!is.na(x))))

Can anyone help?

Thank you for your time

like image 919
John Smith Avatar asked May 13 '15 10:05

John Smith


Video Answer


2 Answers

Using data.table you could do (I've added a real NA to your original data). I'm also not sure if you really looking for length(unique()) or just length?

library(data.table)
cols <- c("SMS.Type", "Reply.Date")
setDT(df)[, paste0(cols, ".count") := 
                  lapply(.SD, function(x) length(unique(na.omit(x)))), 
                  .SDcols = cols, 
            by = ID]
#            ID SMS.Type         SMS.Date       Reply.Date SMS.Type.count Reply.Date.count
# 1: 1045937900      DF1 12/02/2015 19:51               NA              2                1
# 2: 1045937900    WCB14 13/02/2015 08:38 13/02/2015 09:52              2                1

In the devel version (v >= 1.9.5) you also could use uniqueN function


Explanation

This is a general solution which will work on any number of desired columns. All you need to do is to put the columns names into cols.

  1. lapply(.SD, is calling a certain function over the columns specified in .SDcols = cols
  2. paste0(cols, ".count") creates new column names while adding count to the column names specified in cols
  3. := performs assignment by reference, meaning, updates the newly created columns with the output of lapply(.SD, in place
  4. by argument is specifying the aggregator columns
like image 196
David Arenburg Avatar answered Sep 24 '22 02:09

David Arenburg


After converting your empty strings to NAs:

library(dplyr)
mutate(df, SMS.Type.count   = sum(!is.na(SMS.Type)),
           Reply.Date.count = sum(!is.na(Reply.Date)))
like image 22
user2987808 Avatar answered Sep 24 '22 02:09

user2987808