Remove rows from data: overlapping time intervals?

Tags:

Edit: I'm looking for solution for this question now also with other programming languages.

Based on the other question I asked, I have a dataset like this (for R users, dput for this below) which represents user computer sessions:

Click to copy

   username          machine               start                 end
1     user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27
2     user1 D5599.domain.com 2011-01-03 09:46:29 2011-01-03 10:09:16
3     user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17
4     user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15
5     user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16
6     user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23
7     user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22
8     user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59
9     user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43
10    USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53
11    USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22
12    USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43
13    USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38
14    USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43
15    USER2 D5682.domain.com 2011-02-14 14:59:23 2011-02-14 15:14:47
>

There may be several concurrent (overlapping based on time) sessions for the same username from the the same computer. How can I remove those rows so that only one sessions is left for this data? Original data set has approx. 500 000 rows.

The expected output is (rows 2, 15 removed)

Click to copy

   username          machine               start                 end
1     user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27
3     user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17
4     user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15
5     user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16
6     user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23
7     user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22
8     user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59
9     user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43
10    USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53
11    USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22
12    USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43
13    USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38
14    USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43
>

Here is the dataset:

Click to copy

structure(list(username = c("user1", "user1", "user1",
"user1", "user1", "user1", "user1", "user1",
"user1", "USER2", "USER2", "USER2", "USER2", "USER2", "USER2"
), machine = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("D5599.domain.com", "D5645.domain.com",
"D5682.domain.com", "D5686.domain.com", "D5694.domain.com", "D5696.domain.com",
"D5772.domain.com", "D5772.domain.com", "D5847.domain.com", "D5855.domain.com",
"D5871.domain.com", "D5927.domain.com", "D5927.domain.com", "D5952.domain.com",
"D5993.domain.com", "D6012.domain.com", "D6048.domain.com", "D6077.domain.com",
"D5688.domain.com", "D5815.domain.com", "D6106.domain.com", "D6128.domain.com"
), class = "factor"), start = structure(c(1294040658, 1294040789,
1294056456, 1294232597, 1297686819, 1298462070, 1300695018, 1307603561,
1294049025, 1294835165, 1295269579, 1295356050, 1295961655, 1297688580,
1297688363), class = c("POSIXct", "POSIXt"), tzone = ""), end =
structure(c(1294040847,
1294042156, 1294059377, 1294233795, 1297687216, 1298462303, 1300696342,
1307606339, 1294050583, 1294835573, 1295271862, 1295358163, 1295961878,
1297688863, 1297689287), class = c("POSIXct", "POSIXt"), tzone = "")),
.Names = c("username",
"machine", "start", "end"), row.names = c(NA, 15L), class = "data.frame")

564

asked Jan 26 '12 11:01

jrara

2 Answers

Try the intervals package:

Click to copy

library(intervals)

f <- function(dd) with(dd, {
    r <- reduce(Intervals(cbind(start, end)))
    data.frame(username = username[1],
         machine = machine[1],
         start = structure(r[, 1], class = class(start)),
         end = structure(r[, 2], class = class(end)))
})

do.call("rbind", by(d, d[1:2], f))

With the sample data this reduces the 15 rows to the following 13 rows (by combining rows 1 and 2 and rows 12 and 13 in the original data frame):

Click to copy

   username          machine               start                 end
1     user1 D5599.domain.com 2011-01-03 02:44:18 2011-01-03 03:09:16
2     user1 D5599.domain.com 2011-01-03 07:07:36 2011-01-03 07:56:17
3     user1 D5599.domain.com 2011-01-05 08:03:17 2011-01-05 08:23:15
4     user1 D5599.domain.com 2011-02-14 07:33:39 2011-02-14 07:40:16
5     user1 D5599.domain.com 2011-02-23 06:54:30 2011-02-23 06:58:23
6     user1 D5599.domain.com 2011-03-21 04:10:18 2011-03-21 04:32:22
7     user1 D5645.domain.com 2011-06-09 03:12:41 2011-06-09 03:58:59
8     user1 D5682.domain.com 2011-01-03 05:03:45 2011-01-03 05:29:43
9     USER2 D5682.domain.com 2011-01-12 07:26:05 2011-01-12 07:32:53
10    USER2 D5682.domain.com 2011-01-17 08:06:19 2011-01-17 08:44:22
11    USER2 D5682.domain.com 2011-01-18 08:07:30 2011-01-18 08:42:43
12    USER2 D5682.domain.com 2011-01-25 08:20:55 2011-01-25 08:24:38
13    USER2 D5682.domain.com 2011-02-14 07:59:23 2011-02-14 08:14:47

177

answered Sep 30 '22 06:09

G. Grothendieck

One solution is to first split the intervals, so that they are sometimes equal but never partially overlap, and them remove the duplicates. The problem is that we are left with many small abutting intervals, and merging them does not look straightforward.

Click to copy

library(reshape2)
library(sqldf)
d$machine <- as.character( d$machine ) # Duplicated levels...
ddply( d, c("username", "machine"), function (u) {
  # For each username and machine, 
  # compute all the possible non-overlapping intervals
  intervals <- sort(unique( c(u$start, u$end) ))
  intervals <- data.frame( 
    start = intervals[-length(intervals)], 
    end   = intervals[-1] 
  )
  # Only retain those actually in the data
  u <- sqldf( "
    SELECT DISTINCT u.username, u.machine, 
                    intervals.start, intervals.end
    FROM  u, intervals 
    WHERE       u.start <= intervals.start 
    AND   intervals.end <=         u.end
  " )
  # We have non-overlapping, but potentially abutting intervals:
  # ideally, we should merge them, but I do not see an easy 
  # way to do so.
  u
} )

EDIT: Another, conceptually cleaner solution, that fixes the non-merged abutting intervals problem, is to count the number of open sessions for each user and machine: when it stops being zero, the user has logged in (with one or more session), when it drops to zero, the user has closed all his/her sessions.

Click to copy

ddply( d, c("username", "machine"), function (u) {
  a <- rbind( 
    data.frame( time = min(u$start) - 1, sessions = 0 ),
    data.frame( time = u$start, sessions = 1 ), 
    data.frame( time = u$end,   sessions = -1 ) 
  )
  a <- a[ order(a$time), ]
  a$sessions <- cumsum(a$sessions)
  a$previous <- c( 0, a$sessions[ - nrow(a) ] )
  a <- a[ a$previous == 0 & a$sessions  > 0 | 
          a$previous  > 0 & a$sessions == 0, ]
  a$previous_time <- a$time
  a$previous_time[-1] <- a$time[ -nrow(a) ]
  a <- a[ a$previous > 0 & a$sessions == 0, ]
  a <- data.frame( 
    username = u$username[1],
    machine  = u$machine[1],
    start = a$previous_time,
    end   = a$time
  )
  a
} )

answered Sep 30 '22 06:09

Vincent Zoonekynd

Related questions
                            
                                Deploying Django on an apache server
                            
                                Greedy match with negative lookahead in a regular expression
                            
                                Why is putting the module level code into a function and then calling the function is faster in Python?
                            
                                Scapy: Adding new protocol with complex field groupings
                            
                                Python. Doing some work on background with Gtk GUI
                            
                                Clip an image using several patches in matplotlib
                            
                                How do you make the linewidth of a single line change as a function of x in matplotlib?
                            
                                How to pass class's self through a flask.Blueprint.route decorator?
                            
                                Python naming conventions for functions that do modify the object or return a modified copy
                            
                                matplotlib plt.show() only selected objects
                            
                                Connect python to oracle
                            
                                Inverted fancy indexing
                            
                                How to use Pageant with Paramiko on Windows?
                            
                                How to avoid the "This message may not have been sent by" warning when sending email using Google App Engine?
                            
                                Python: Catching the output from subprocess.call with stdout
                            
                                Scripting library for monitoring server health?
                            
                                Python - how can I change default path when installing modules?
                            
                                Does the inaccessible `.0` variable in `locals()` affect memory or performance?
                            
                                Encoding used for u"" literals
                            
                                Decorating a method that's already a classmethod?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove rows from data: overlapping time intervals?

Tags:

python

powershell

r

perl

jrara

People also ask

2 Answers

G. Grothendieck

Vincent Zoonekynd

Recent Activity

Donate For Us