Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reshaping data in R with "login" "logout" times

Tags:

r

tidyr

I'm new to R, and am working on a side project for my own purposes. I have this data (reproducable dput of this is at the end of the question):

     X            datetime  user  state
1    1 2016-02-19 19:13:26 User1 joined
2    2 2016-02-19 19:21:18 User2 joined
3    3 2016-02-19 19:21:33 User1 joined
4    4 2016-02-19 19:35:38 User1 joined
5    5 2016-02-19 19:44:15 User1 joined
6    6 2016-02-19 19:48:55 User1 joined
7    7 2016-02-19 19:52:40 User1 joined
8    8 2016-02-19 19:53:15 User3 joined
9    9 2016-02-19 20:02:34 User3 joined
10  10 2016-02-19 20:13:48 User3 joined
19 637 2016-02-19 19:13:32 User1   left
20 638 2016-02-19 19:25:26 User1   left
21 639 2016-02-19 19:30:30 User2   left
22 640 2016-02-19 19:42:16 User1   left
23 641 2016-02-19 19:47:59 User1   left
24 642 2016-02-19 19:51:06 User1   left
25 643 2016-02-19 20:02:26 User3   left 

I want it to look like this:

    user  joined                left
1   User1 2016-02-19 19:13:26   2016-02-19 19:13:32
2   User2 2016-02-19 19:21:18   2016-02-19 19:30:30
3   User3 2016-02-19 19:53:15   2016-02-19 20:02:26 
4   User1 2016-02-19 19:21:33   2016-02-19 19:25:26
.
.
.

I'm looking at tidyr as there's some reshaping involved obviously, but I can't wrap my head around what exactly needs to be done. Is this even possible (without looping/massive amounts of procedural code)? The problem I can't grasp how to get around is that there's no way to know that a particular "left" record should be joined to a particular "joined" record. Examples I can find all involve a static month or day over which other values are gathered. I should add that it's not necessarily guaranteed that all records are guaranteed to have a "left" value (a user might still be "joined").

Here's the dput output of a sample of the data:

> dput(samp)
structure(list(X = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 637L, 638L, 639L, 640L, 
641L, 642L, 643L, 644L, 645L, 646L, 647L, 648L, 649L, 650L, 651L
), datetime = structure(c(1L, 3L, 4L, 7L, 9L, 11L, 13L, 14L, 
16L, 18L, 21L, 22L, 23L, 26L, 27L, 30L, 32L, 33L, 2L, 5L, 6L, 
8L, 10L, 12L, 15L, 17L, 19L, 20L, 24L, 25L, 28L, 29L, 31L), .Label = c("2016-02-19 19:13:26", 
"2016-02-19 19:13:32", "2016-02-19 19:21:18", "2016-02-19 19:21:33", 
"2016-02-19 19:25:26", "2016-02-19 19:30:30", "2016-02-19 19:35:38", 
"2016-02-19 19:42:16", "2016-02-19 19:44:15", "2016-02-19 19:47:59", 
"2016-02-19 19:48:55", "2016-02-19 19:51:06", "2016-02-19 19:52:40", 
"2016-02-19 19:53:15", "2016-02-19 20:02:26", "2016-02-19 20:02:34", 
"2016-02-19 20:13:38", "2016-02-19 20:13:48", "2016-02-19 20:42:27", 
"2016-02-19 20:48:22", "2016-02-19 20:49:31", "2016-02-19 20:59:58", 
"2016-02-19 21:06:20", "2016-02-19 21:10:43", "2016-02-19 21:11:13", 
"2016-02-19 21:11:15", "2016-02-19 21:11:22", "2016-02-19 21:17:33", 
"2016-02-19 22:02:45", "2016-02-19 22:05:18", "2016-02-19 22:05:37", 
"2016-02-19 22:05:47", "2016-02-19 22:30:30"), class = "factor"), 
    user = structure(c(1L, 2L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 
    3L, 4L, 1L, 1L, 4L, 4L, 4L, 3L, 1L, 1L, 2L, 1L, 1L, 1L, 3L, 
    3L, 3L, 1L, 4L, 1L, 1L, 4L, 4L), .Label = c("User1", "User2", 
    "User3", "User4"), class = "factor"), state = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L), .Label = c("joined", "left"), class = "factor")), .Names = c("X", 
"datetime", "user", "state"), class = "data.frame", row.names = c(NA, 
-33L))
like image 833
Tim Coker Avatar asked Mar 11 '16 04:03

Tim Coker


4 Answers

Using rowid() from the data.table-package along with dcast:

require(data.table)
dcast(dt, user + rowid(user, state) ~ state, value.var="datetime")

#      user user_1              joined                left
#  1: User1      1 2016-02-19 19:13:26 2016-02-19 19:13:32
#  2: User1      2 2016-02-19 19:21:33 2016-02-19 19:25:26
#  3: User1      3 2016-02-19 19:35:38 2016-02-19 19:42:16
#  4: User1      4 2016-02-19 19:44:15 2016-02-19 19:47:59
#  5: User1      5 2016-02-19 19:48:55 2016-02-19 19:51:06
#  6: User1      6 2016-02-19 19:52:40                <NA>
#  7: User2      1 2016-02-19 19:21:18 2016-02-19 19:30:30
#  8: User3      1 2016-02-19 19:53:15 2016-02-19 20:02:26
#  9: User3      2 2016-02-19 20:02:34                <NA>
# 10: User3      3 2016-02-19 20:13:48                <NA>
like image 141
Arun Avatar answered Nov 12 '22 02:11

Arun


We can make use of the order of "left" and "joined", and match when one follows the other for each user.

For this I'm going to use library(data.table)

library(data.table)
setDT(df)

## order the data by user and datetime
df <- df[order(user, datetime)]
## add an 'order' column, which is a sequence from 1 to lenght()  
## for each user
df[, order := seq(1:.N), by=user]

## split the left and joins
dt_left <- df[state == "left"]
dt_joined <- df[state == "joined"]

## assuming 'left' is after 'joined', shift the 'order' back for left
dt_left[, order := order - 1]

## join user an dorder (and subsetting relevant columns) 
## keeping when there's a 'joined' but not a 'left'
dt <- dt_left[, .(user, order, datetime)][dt_joined[, .(user, order, datetime)], on=c("user", "order"), nomatch=NA]

## rename columns
setnames(dt, c("datetime", "i.datetime"), c("left", "joined"))

     user order                left              joined
 1: User1     1 2016-02-19 19:13:32 2016-02-19 19:13:26
 2: User1     3 2016-02-19 19:25:26 2016-02-19 19:21:33
 3: User1     5 2016-02-19 19:42:16 2016-02-19 19:35:38
 4: User1     7 2016-02-19 19:47:59 2016-02-19 19:44:15
 5: User1     9 2016-02-19 19:51:06 2016-02-19 19:48:55
 6: User1    11 2016-02-19 20:48:22 2016-02-19 19:52:40
 7: User1    13 2016-02-19 21:11:13 2016-02-19 21:06:20
 8: User1    15 2016-02-19 21:17:33 2016-02-19 21:11:15
 9: User2     1 2016-02-19 19:30:30 2016-02-19 19:21:18
10: User3     1 2016-02-19 20:02:26 2016-02-19 19:53:15
11: User3     3 2016-02-19 20:13:38 2016-02-19 20:02:34
12: User3     5 2016-02-19 20:42:27 2016-02-19 20:13:48
13: User3     7                  NA 2016-02-19 20:49:31
14: User3     8                  NA 2016-02-19 22:30:30
15: User4     1 2016-02-19 21:10:43 2016-02-19 20:59:58
16: User4     3 2016-02-19 22:02:45 2016-02-19 21:11:22
17: User4     5 2016-02-19 22:05:37 2016-02-19 22:05:18
18: User4     7                  NA 2016-02-19 22:05:47
like image 21
SymbolixAU Avatar answered Nov 12 '22 02:11

SymbolixAU


Base version:

samp$count <- with(samp, ave(as.character(user),list(state,user),FUN=seq_along) )

out <- merge(
  samp[samp$state=="joined",c("user","datetime","count")],
  samp[samp$state=="left",c("user","datetime","count")],
  by=c("user","count"), all.x=TRUE
)

out[order(out$count),]
like image 5
thelatemail Avatar answered Nov 12 '22 02:11

thelatemail


Another way to do it:

library(tidyr)
df <- df %>% spread(state, datetime)

df_joined <- df[!is.na(df$joined), 2:3]
df_joined <- df_joined[with(df_joined, order(user, joined)), ]

df_left <- df[!is.na(df$left), c(2, 4)]
df_left <- df_left[with(df_left, order(user, left)), ]

merge(df_joined, df_left, all = TRUE, by = 'user')
like image 3
ytk Avatar answered Nov 12 '22 02:11

ytk