I have a data.frame that I have constructed using a XML file, and now I want to count and sum its values, something like count and sum in SQL.
This is how the data.frame looks:
msgDataSource msgFileSource processDate msgNumRows
1 source1 Quarter 2015-01-30 30
2 source1 Month 2015-01-30 15
3 source1 Month 2015-01-30 20
4 source1 Year 2015-01-30 1
5 source2 Quarter 2015-01-30 30
6 source3 Quarter 2015-01-30 15
7 source1 Year 2015-02-01 80
8 source2 Year 2015-02-01 90
9 source1 Quarter 2015-02-01 5
10 source2 Quarter 2015-03-15 9
11 source3 Quarter 2015-03-15 14
This is what I need
processDate msgFileSource msgDataSource sumDataSources countDataSources
1: 2015-01-30 Month source1 35 2
2: 2015-01-30 Quarter source1 30 1
3: 2015-01-30 Quarter source2 30 1
4: 2015-01-30 Quarter source3 15 1
5: 2015-01-30 Year source1 1 1
6: 2015-02-01 Quarter source1 5 1
7: 2015-02-01 Year source1 80 1
8: 2015-02-01 Year source2 90 1
9: 2015-03-15 Quarter source2 9 1
10: 2015-03-15 Quarter source3 14 1
This is what I am able to get so far:
processDate msgFileSource msgDataSource sumDataSources
1: 2015-01-30 Month source1 35
2: 2015-01-30 Quarter source1 30
3: 2015-01-30 Quarter source2 30
4: 2015-01-30 Quarter source3 15
5: 2015-01-30 Year source1 1
6: 2015-02-01 Quarter source1 5
7: 2015-02-01 Year source1 80
8: 2015-02-01 Year source2 90
9: 2015-03-15 Quarter source2 9
10: 2015-03-15 Quarter source3 14
And this is my code:
dfFullData <- data.frame (
msgDataSource = c("source1", "source1", "source1", "source1", "source2", "source3", "source1", "source2", "source1", "source2", "source3"),
msgFileSource = c("Quarter", "Month", "Month", "Year", "Quarter", "Quarter", "Year", "Year", "Quarter", "Quarter", "Quarter"),
processDate = c("2015-01-30", "2015-01-30", "2015-01-30", "2015-01-30", "2015-01-30", "2015-01-30", "2015-02-01", "2015-02-01", "2015-02-01", "2015-03-15", "2015-03-15"),
msgNumRows = c(30, 15, 20, 1, 30, 15, 80, 90, 5, 9, 14),
stringsAsFactors=FALSE
)
summaryTable <- data.table(dfFullData)
summaryTable <- summaryTable[
order(processDate, msgFileSource, msgDataSource),
sum(msgNumRows),
by=list(processDate, msgFileSource, msgDataSource)
]
setnames(summaryTable, "V1", "sumDataSources")
print(summaryTable)
Is there a way to calculate the count on one pass, or should I count it separately and then perform an cbind?
How do I achieve what I need?
Thanks.
Use list
to make a list of the summary columns that you want in your aggregated data.table
. Use the in-built symbol .N
to find the number of rows in your subset:
summaryTable <- summaryTable[
order(processDate, msgFileSource, msgDataSource),
list(sumDataSources=sum(msgNumRows),
countDataSources=.N),
by=list(processDate, msgFileSource, msgDataSource) ]
Using list
this way will also mean that you don't need to use setnames
later, since you have already named your columns in your list
.
This is not relevant to the actual question but, as detailed by the comments below this answer, the use of an additional order
in the above command can be negated by using keyby
instead of by
. The final command will look like this:
summaryTable <- summaryTable[, list(sumDataSources=sum(msgNumRows),
countDataSources=.N),
keyby=list(processDate, msgFileSource, msgDataSource) ]
keyby
also has the additional benefit of setting it's arguments as the keys for the resulting table, the ordering of which is a by product from this process.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With