Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to structure data in summary format using R

I have created the below-mentioned data frame in R.

My_DF

ID        Date                  Type       Remark      Price
PRT-11    2020-12-01 10:12:14   SS_RT      AT_1_O      1000
PRT-11    2020-12-01 10:12:14   SS_RT      AT_1_O      1200
PRT-11    2020-12-01 10:12:14   SS_RT      AT_1_O      1600
PRT-11    2020-12-01 10:12:14   SS_RG      AT_1_A      1600
PRT-11    2020-12-01 10:12:14   SS_RG      AT_1_B      1600
PRT-11    2020-12-01 10:12:14   SS_RG      AT_1_C      1000
PRT-11    2020-12-01 10:12:14   SS_RT      AT_1_Y      1200
PRT-11    2020-12-07 10:12:14   SS_RT      AT_1_U      1600
PRT-11    2020-12-07 10:12:14   SS_RI      AT_1_M      1600
PRT-11    2020-12-07 10:12:14   SS_RO      AT_1_P      1600

I want to covert the above-mentioned DF in the following structure Dataframe and convert it in HTML format which can be used to send email using mailR library.

enter image description here

Where I have followed the following condition.

  • If Type is equal to SS_RT then it is Type - A

  • If Type is anything other than SS_RT then it is Type - B

  • If Type is equal to SS_RT with Remark is equal to AT_1_O then it is Type - A1

  • If Type is equal to SS_RT with Remark other than AT_1_O then it is Type - A2

  • The Formula for Type - A1 (Excl) is Type - A1 divided by the sum of Type - A1and Type - A2

  • The Formula for Type - A1 (Excl) is Type - A2 divided by the sum of Type - A1and Type - A2

Rest all the %age formula are pretty straightforward by Total in the denominator.

In the data frame, it is possible the there is no entry for a particular date. For that, we need to ensure that of all the available dates we need to take min and max date and ensure that for that date which is not available we show value as 0 in bother count and sum column.

I have merged the date in two rows the first row is used for the count and the second one is for the sum group by their logical defination.

like image 211
Viper Avatar asked Mar 18 '26 11:03

Viper


1 Answers

Here is a data.table solution. I tried to avoid manual calculations and though of a solution based on long to wide transformation. Here is my solution, with step by step detail after:

library(lubridate)
library(data.table)

dt <- setDT(dt)
dt[,Date := date(Date)]
dt[,type := fifelse(Type == "SS_RT",fifelse(Remark == "AT_1_O","A1","A2"),"B")]
## transform to wide
df2 <- rbind(dcast(data = dt,Date~type ,value.var = "Price",fill = 0)[,linetype := "count"],
             dcast(data = dt,Date~type ,value.var = "Price",fill = 0,fun.aggregate = sum)[,linetype := "value"])
## A and tot
df2[,tot := rowSums(.SD),.SDcols = c("A1","A2","B")]
df2[,A := A1+A2]
## create pc
cols <- c("A","A1","A2","B")
df2[,paste0(cols,"_pc") := lapply(.SD,function(x) round(x/tot*100) ),.SDcols = cols]
cols <- c("A1","A2")
df2[,paste0(cols,"_exc") := lapply(.SD,function(x) round(x/(A1+A2)*100) ),.SDcols = cols]
## add missing dates
df2 <- merge(CJ(Date = seq(min(dt$Date),max(dt$Date),1),linetype = c("count","value")),
             df2,all = T,by = c("Date","linetype"))

df2[is.na(df2)] <- 0
df2[,linetype := NULL]
df2

          Date   A1   A2    B  tot    A A_pc A1_pc A2_pc B_pc A1_exc A2_exc
 1: 2020-12-01    3    1    3    7    4   57    43    14   43     75     25
 2: 2020-12-01 3800 1200 4200 9200 5000   54    41    13   46     76     24
 3: 2020-12-02    0    0    0    0    0    0     0     0    0      0      0
 4: 2020-12-02    0    0    0    0    0    0     0     0    0      0      0
 5: 2020-12-03    0    0    0    0    0    0     0     0    0      0      0
 6: 2020-12-03    0    0    0    0    0    0     0     0    0      0      0
 7: 2020-12-04    0    0    0    0    0    0     0     0    0      0      0
 8: 2020-12-04    0    0    0    0    0    0     0     0    0      0      0
 9: 2020-12-05    0    0    0    0    0    0     0     0    0      0      0
10: 2020-12-05    0    0    0    0    0    0     0     0    0      0      0
11: 2020-12-06    0    0    0    0    0    0     0     0    0      0      0
12: 2020-12-06    0    0    0    0    0    0     0     0    0      0      0
13: 2020-12-07    0    1    2    3    1   33     0    33   67      0    100
14: 2020-12-07    0 1600 3200 4800 1600   33     0    33   67      0    100

So first step is I create the type variable following your rule:

dt[,Date := date(Date)]
dt[,type := fifelse(Type == "SS_RT",fifelse(Remark == "AT_1_O","A1","A2"),"B")]

We know A is just A1 + A2. It allows me to transform the table to wide format. I do it twice: once to count, once to make the sum per type:

dcast(data = dt,Date ~ type ,value.var = "Price",fill = 0)

         Date A1 A2 B 
1: 2020-12-01  3  1 3    
2: 2020-12-07  0  1 2    

Here I count the number of occurrence for each type, because it uses the default aggregate: lenght. If I use sum as aggregate function:

dcast(data = dt,Date~type ,value.var = "Price",fill = 0,fun.aggregate = sum)

         Date   A1   A2    B
1: 2020-12-01 3800 1200 4200
2: 2020-12-07    0 1600 3200

I add the linetype variable, which will help me after to add the missing dates (I use it to keep two lines per dates).

I bind the two, I obtain:

         Date   A1   A2    B linetype
1: 2020-12-01    3    1    3    count
2: 2020-12-07    0    1    2    count
3: 2020-12-01 3800 1200 4200    value
4: 2020-12-07    0 1600 3200    value

I then calculate A and the total:

df2[,tot := rowSums(.SD),.SDcols = c("A1","A2","B")]
df2[,A := A1+A2]

I then calculate the percentage (_pc) and Excl variables (that I names _exc for simplicity), using lapply and a vector of the column I want to transform. I use fifelse to avoid dividing by 0:

cols <- c("A","A1","A2","B")
df2[,paste0(cols,"_pc") := lapply(.SD,function(x) round(x/tot*100) ),.SDcols = cols]
cols <- c("A1","A2")
df2[,paste0(cols,"_exc") := lapply(.SD,function(x) round(x/(A1+A2)*100) ),.SDcols = cols]


         Date   A1   A2    B linetype  tot    A A_pc A1_pc A2_pc B_pc A1_exc A2_exc
1: 2020-12-01    3    1    3    count    7    4   57    43    14   43     75     25
2: 2020-12-01 3800 1200 4200    value 9200 5000   54    41    13   46     76     24
3: 2020-12-07    0    1    2    count    3    1   33     0    33   67      0    100
4: 2020-12-07    0 1600 3200    value 4800 1600   33     0    33   67      0    100

I then add the missing dates, by merging with all combination of linetype and Date and keeping all rows. I use the CJ function to create a data.table with all combinations of the two variables:

CJ(Date = seq(min(dt$Date),max(dt$Date),1),linetype = c("count","value"))
          Date linetype
 1: 2020-12-01    count
 2: 2020-12-01    value
 3: 2020-12-02    count
 4: 2020-12-02    value
 5: 2020-12-03    count
 6: 2020-12-03    value
 7: 2020-12-04    count
 8: 2020-12-04    value
 9: 2020-12-05    count
10: 2020-12-05    value
11: 2020-12-06    count
12: 2020-12-06    value
13: 2020-12-07    count
14: 2020-12-07    value

And then replace missing values with 0 and supress the linetype variable.

You can then reorder the columns with setcolorder, and use kabbleExtra (see here) to produce your html output.

You can do the same with dplyr, using pivot_wider to transform to wide, mutate_all instead of the lapply(.SD,...) to do your calculations, expand.grid instead of CJ to generate the table of missing dates.

like image 131
denis Avatar answered Mar 21 '26 02:03

denis



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!