Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Labeling unique values in R

My data look like:

data <- matrix(c("1","install","2015-10-23 14:07:20.000000",
                 "2","install","2015-10-23 14:08:20.000000",
                 "3","install","2015-10-23 14:07:25.000000",
                 "3","sale","2015-10-23 14:08:20.000000",
                 "4","install","2015-10-23 14:07:20.000000",
                 "4","sale","2015-10-23 14:09:20.000000",
                 "4","sale","2015-10-23 14:11:20.000000"),
               ncol=3, byrow=TRUE)
colnames(data) <- c("id","event","time")

I would like to add a fourth column, called label, in which I label accordingly each row on some values. In this case:

  • a "0" label if the id is unique
  • a "1" label if the id is not unique and it has associated 1 sale
  • a "2" label if the id is not unique and it has associated 2 sales

and so on up to n sales.

it should be finally look like:

data1 <- matrix(c("1","install","2015-10-23 14:07:20.000000","0",
                  "2","install","2015-10-23 14:08:20.000000","0",
                  "3","install","2015-10-23 14:07:25.000000","1",
                  "3","sale","2015-10-23 14:08:20.000000","1",
                  "4","install","2015-10-23 14:07:20.000000","2",
                  "4","sale","2015-10-23 14:09:20.000000","2",
                  "4","sale","2015-10-23 14:11:20.000000","2"),
                 ncol=4, byrow=TRUE)

It's not clear to me what's the best approach in R to create "labels" based on conditions... maybe dplyr::mutate?

like image 411
chopin_is_the_best Avatar asked Feb 08 '23 15:02

chopin_is_the_best


1 Answers

Updated to reflect "and so on up to n sales."-requirement.

A dplyr option could be:

library(dplyr)
data <- as.data.frame(data)
data %>% 
  group_by(id) %>% 
  mutate(label = if(n() == 1) 0 else as.numeric(sum(event == "sale")))

#Source: local data frame [7 x 4]
#Groups: id [4]
#
#      id   event                       time label
#  (fctr)  (fctr)                     (fctr) (dbl)
#1      1 install 2015-10-23 14:07:20.000000     0
#2      2 install 2015-10-23 14:08:20.000000     0
#3      3 install 2015-10-23 14:07:25.000000     1
#4      3    sale 2015-10-23 14:08:20.000000     1
#5      4 install 2015-10-23 14:07:20.000000     2
#6      4    sale 2015-10-23 14:09:20.000000     2
#7      4    sale 2015-10-23 14:11:20.000000     2

The data.table equivalent would be:

library(data.table)
data <- as.data.table(data)  # or setDT(data) if it's already a data.frame
data[, label := if(.N == 1) 0 else as.numeric(sum(event == "sale")), by=id]
like image 51
talat Avatar answered Feb 11 '23 06:02

talat