Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to calculate unique count using dcast in R

Tags:

r

transpose

dcast

I'm using dcast to transpose the following table

date               event          user_id
25-07-2020         Create          3455
25-07-2020         Visit           3567
25-07-2020         Visit           3567
25-07-2020         Add             3567
25-07-2020         Add             3678
25-07-2020         Add             3678
25-07-2020         Create          3567
24-07-2020         Edit            3871

I'm using dcast to transpose to have my events as columns and count user_id

dae_summ <- dcast(ahoy_events, date ~ event, value.var="user_id")

But I'm not getting unique user id's. its counting the same user_id multiple times. What can I do to get one user_id counted only one time for the same date and event.

like image 963
Abhi Avatar asked Jul 25 '20 19:07

Abhi


4 Answers

We could use uniqueN from data.table

library(data.table)
dcast(setDT(ahoy_events), date ~ event, fun.aggregate = uniqueN)
#         date Add Create Edit Visit
#1: 24-07-2020   0      0    1     0
#2: 25-07-2020   2      2    0     1

Or using pivot_wider from tidyr with values_fn specified as n_distinct

library(tidyr)
library(dplyr)
ahoy_events %>%
   pivot_wider(names_from = event, values_from = user_id, 
      values_fn = list(user_id = n_distinct), values_fill = list(user_id = 0))
# A tibble: 2 x 5
#   date       Create Visit   Add  Edit
#  <chr>       <int> <int> <int> <int>
#1 25-07-2020      2     1     2     0
#2 24-07-2020      0     0     0     1

data

ahoy_events <- structure(list(date = c("25-07-2020", "25-07-2020", "25-07-2020", 
"25-07-2020", "25-07-2020", "25-07-2020", "25-07-2020", "24-07-2020"
), event = c("Create", "Visit", "Visit", "Add", "Add", "Add", 
"Create", "Edit"), user_id = c(3455L, 3567L, 3567L, 3567L, 3678L, 
3678L, 3567L, 3871L)), class = "data.frame", row.names = c(NA, 
-8L))
like image 78
akrun Avatar answered Oct 26 '22 05:10

akrun


You can try:

library(reshape2)

#Data
df <- structure(list(date = c("25-07-2020", "25-07-2020", "25-07-2020", 
"25-07-2020", "25-07-2020", "25-07-2020", "25-07-2020", "24-07-2020"
), event = c("Create", "Visit", "Visit", "Add", "Add", "Add", 
"Create", "Edit"), user_id = c(3455L, 3567L, 3567L, 3567L, 3678L, 
3678L, 3567L, 3871L)), class = "data.frame", row.names = c(NA, 
-8L))

#New code
dae_summ <- dcast(df, date ~ event,  value.var="user_id",fun.aggregate = function(x) length(unique(x)))

        date Add Create Edit Visit
1 24-07-2020   0      0    1     0
2 25-07-2020   2      2    0     1

Your code produces this:

        date Add Create Edit Visit
1 24-07-2020   0      0    1     0
2 25-07-2020   3      2    0     2

So there is a difference.

like image 43
Duck Avatar answered Oct 26 '22 04:10

Duck


Using the reshape2 package, you can utilize the following:

library(reshape2)

Data:

zz <- "date               event          user_id
       25-07-2020         Create          3455
       25-07-2020         Visit           3567
       25-07-2020         Visit           3567
       25-07-2020         Add             3567
       25-07-2020         Add             3678
       25-07-2020         Add             3678
       25-07-2020         Create          3567
       24-07-2020         Edit            3871"
data <- read.table(text=zz, header = TRUE)

Code:

data %>% 
  dcast(user_id ~ event, value.var="user_id",fun.aggregate = function(x) length(unique(x)))

Output:

date         Add     Create      Edit      Visit
<fctr>       <int>   <int>       <int>     <int>
24-07-2020   0       0           1         0
25-07-2020   2       2           0         1

Created on 2020-07-25 by the reprex package (v0.3.0)

like image 1
iamericfletcher Avatar answered Oct 26 '22 03:10

iamericfletcher


A base R option using reshape

out <- replace(
  u <- reshape(
    unique(transform(ahoy_events, user_id = ave(user_id, event, date, FUN = function(x) length(unique(x))))),
    direction = "wide",
    idvar = "date",
    timevar = "event"
  ),
  is.na(u),
  0
)

such that

> out
        date user_id.Create user_id.Visit user_id.Add user_id.Edit
1 25-07-2020              2             1           2            0
8 24-07-2020              0             0           0            1

data

  "25-07-2020", "25-07-2020", "25-07-2020",
  "25-07-2020", "25-07-2020", "25-07-2020", "25-07-2020", "24-07-2020"
), event = c(
  "Create", "Visit", "Visit", "Add", "Add", "Add",
  "Create", "Edit"
), user_id = c(
  3455L, 3567L, 3567L, 3567L, 3678L,
  3678L, 3567L, 3871L
)), class = "data.frame", row.names = c(
  NA,
  -8L
))
like image 1
ThomasIsCoding Avatar answered Oct 26 '22 05:10

ThomasIsCoding