Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Widening a dataframe to get monthly sums of revenue for all unique values of catogorical columns in R

I have a df which has data like this:

sub = c("X001","X002", "X001","X003","X002","X001","X001","X003","X002","X003","X003","X002") 
month = c("201506", "201507", "201506","201507","201507","201508", "201508","201507","201508","201508", "201508", "201508") 
tech = c("mobile", "tablet", "PC","mobile","mobile","tablet", "PC","tablet","PC","PC", "mobile", "tablet") 
brand = c("apple", "samsung", "dell","apple","samsung","apple", "samsung","dell","samsung","dell", "dell", "dell")

revenue = c(20, 15, 10,25,20,20, 17,9,14,12, 9, 11)

df = data.frame(sub, month, brand, tech, revenue)

I want to use sub and month as key and get one row for every subscriber per month which displays the sum of revenues for the unique values in tech and brand for that subscriber for that month. This example is simplified and with less columns as I have a huge data set I decided to try doing it with data.table.

I have managed to do this for one catagorical column, either tech or brand using this:

df1 <- dcast(df, sub + month ~ tech,  fun=sum, value.var = "revenue")

but I want to do it for two or more caqtogorical columns, so far I've tried this:

df2 <- dcast(df, sub + month ~ tech+brand,  fun=sum, value.var = "revenue")

and it just concatenates the unique values of both catogorical columns and sums for that but I do not want that. I wan seperate columns for each unique value of all catogorical columns.

I'm new to R and would really appreciate any help.

like image 938
ljourney Avatar asked Nov 08 '16 08:11

ljourney


Video Answer


1 Answers

(I will assume that df is a data.table rather a data.frame like in your example.)

One possible solution for this is to first melt the data while keeping sub, month and revenue as keys. This way, brand and tech will be converted to a single variable with a value corresponding to each existing combination of the keys. This way we will be able to easily dcast it back as we will be operating against a single column- like in your first example

dcast(melt(df, c(1:2, 5)), sub + month ~ value, sum, value.var = "revenue")
#     sub  month PC apple dell mobile samsung tablet
# 1: X001 201506 10    20   10     20       0      0
# 2: X001 201508 17    20    0      0      17     20
# 3: X002 201507  0     0    0     20      35     15
# 4: X002 201508 14     0   11      0      14     11
# 5: X003 201507  0    25    9     25       0      9
# 6: X003 201508 12     0   21      9       0      0

As per OPs comment, you can easily add a prefix by adding also variable to the formula. This way, the column will be also ordered properly

dcast(melt(df, c(1:2, 5)), sub + month ~ variable + value, sum, value.var = "revenue")
#     sub  month brand_apple brand_dell brand_samsung tech_PC tech_mobile tech_tablet
# 1: X001 201506          20         10             0      10          20           0
# 2: X001 201508          20          0            17      17           0          20
# 3: X002 201507           0          0            35       0          20          15
# 4: X002 201508           0         11            14      14           0          11
# 5: X003 201507          25          9             0       0          25           9
# 6: X003 201508           0         21             0      12           9           0
like image 164
David Arenburg Avatar answered Nov 03 '22 01:11

David Arenburg