Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove rows from data: overlapping time intervals?

Edit: I'm looking for solution for this question now also with other programming languages.

Based on the other question I asked, I have a dataset like this (for R users, dput for this below) which represents user computer sessions:

   username          machine               start                 end
1     user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27
2     user1 D5599.domain.com 2011-01-03 09:46:29 2011-01-03 10:09:16
3     user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17
4     user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15
5     user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16
6     user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23
7     user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22
8     user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59
9     user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43
10    USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53
11    USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22
12    USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43
13    USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38
14    USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43
15    USER2 D5682.domain.com 2011-02-14 14:59:23 2011-02-14 15:14:47
>

There may be several concurrent (overlapping based on time) sessions for the same username from the the same computer. How can I remove those rows so that only one sessions is left for this data? Original data set has approx. 500 000 rows.

The expected output is (rows 2, 15 removed)

   username          machine               start                 end
1     user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27
3     user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17
4     user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15
5     user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16
6     user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23
7     user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22
8     user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59
9     user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43
10    USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53
11    USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22
12    USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43
13    USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38
14    USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43
>

Here is the dataset:

structure(list(username = c("user1", "user1", "user1",
"user1", "user1", "user1", "user1", "user1",
"user1", "USER2", "USER2", "USER2", "USER2", "USER2", "USER2"
), machine = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("D5599.domain.com", "D5645.domain.com",
"D5682.domain.com", "D5686.domain.com", "D5694.domain.com", "D5696.domain.com",
"D5772.domain.com", "D5772.domain.com", "D5847.domain.com", "D5855.domain.com",
"D5871.domain.com", "D5927.domain.com", "D5927.domain.com", "D5952.domain.com",
"D5993.domain.com", "D6012.domain.com", "D6048.domain.com", "D6077.domain.com",
"D5688.domain.com", "D5815.domain.com", "D6106.domain.com", "D6128.domain.com"
), class = "factor"), start = structure(c(1294040658, 1294040789,
1294056456, 1294232597, 1297686819, 1298462070, 1300695018, 1307603561,
1294049025, 1294835165, 1295269579, 1295356050, 1295961655, 1297688580,
1297688363), class = c("POSIXct", "POSIXt"), tzone = ""), end =
structure(c(1294040847,
1294042156, 1294059377, 1294233795, 1297687216, 1298462303, 1300696342,
1307606339, 1294050583, 1294835573, 1295271862, 1295358163, 1295961878,
1297688863, 1297689287), class = c("POSIXct", "POSIXt"), tzone = "")),
.Names = c("username",
"machine", "start", "end"), row.names = c(NA, 15L), class = "data.frame")
like image 564
jrara Avatar asked Jan 26 '12 11:01

jrara


People also ask

How do you find overlapping time intervals?

Following is complete algorithm. 1) Sort all intervals in increasing order of start time. This step takes O(nLogn) time. 2) In the sorted array, if start time of an interval is less than end of previous interval, then there is an overlap.

What is an overlapping interval?

Let's take the following overlapping intervals example to explain the idea: If both ranges have at least one common point, then we say that they're overlapping. In other words, we say that two ranges and are overlapping if: On the other hand, non-overlapping ranges don't have any points in common.

How do you find overlapping time intervals in Python?

The basic idea is 1) first take input_start to test_start (if both of them are not equal and input_start is min) 2) always take test_start and test_end 3) take test_end to input_end if test_end is less than input end (and end_input and end_test are not equal).

How do you find overlapping intervals in Java?

Java Program to find overlapping intervals among a given set of intervals. In this approach, we are going to take a count array and for each given interval start time we will do count[startTime]++ and for end time, do count[endTime]--. sum > 2 it means there is overlapping intervals.


2 Answers

Try the intervals package:

library(intervals)

f <- function(dd) with(dd, {
    r <- reduce(Intervals(cbind(start, end)))
    data.frame(username = username[1],
         machine = machine[1],
         start = structure(r[, 1], class = class(start)),
         end = structure(r[, 2], class = class(end)))
})

do.call("rbind", by(d, d[1:2], f))

With the sample data this reduces the 15 rows to the following 13 rows (by combining rows 1 and 2 and rows 12 and 13 in the original data frame):

   username          machine               start                 end
1     user1 D5599.domain.com 2011-01-03 02:44:18 2011-01-03 03:09:16
2     user1 D5599.domain.com 2011-01-03 07:07:36 2011-01-03 07:56:17
3     user1 D5599.domain.com 2011-01-05 08:03:17 2011-01-05 08:23:15
4     user1 D5599.domain.com 2011-02-14 07:33:39 2011-02-14 07:40:16
5     user1 D5599.domain.com 2011-02-23 06:54:30 2011-02-23 06:58:23
6     user1 D5599.domain.com 2011-03-21 04:10:18 2011-03-21 04:32:22
7     user1 D5645.domain.com 2011-06-09 03:12:41 2011-06-09 03:58:59
8     user1 D5682.domain.com 2011-01-03 05:03:45 2011-01-03 05:29:43
9     USER2 D5682.domain.com 2011-01-12 07:26:05 2011-01-12 07:32:53
10    USER2 D5682.domain.com 2011-01-17 08:06:19 2011-01-17 08:44:22
11    USER2 D5682.domain.com 2011-01-18 08:07:30 2011-01-18 08:42:43
12    USER2 D5682.domain.com 2011-01-25 08:20:55 2011-01-25 08:24:38
13    USER2 D5682.domain.com 2011-02-14 07:59:23 2011-02-14 08:14:47
like image 177
G. Grothendieck Avatar answered Sep 30 '22 06:09

G. Grothendieck


One solution is to first split the intervals, so that they are sometimes equal but never partially overlap, and them remove the duplicates. The problem is that we are left with many small abutting intervals, and merging them does not look straightforward.

library(reshape2)
library(sqldf)
d$machine <- as.character( d$machine ) # Duplicated levels...
ddply( d, c("username", "machine"), function (u) {
  # For each username and machine, 
  # compute all the possible non-overlapping intervals
  intervals <- sort(unique( c(u$start, u$end) ))
  intervals <- data.frame( 
    start = intervals[-length(intervals)], 
    end   = intervals[-1] 
  )
  # Only retain those actually in the data
  u <- sqldf( "
    SELECT DISTINCT u.username, u.machine, 
                    intervals.start, intervals.end
    FROM  u, intervals 
    WHERE       u.start <= intervals.start 
    AND   intervals.end <=         u.end
  " )
  # We have non-overlapping, but potentially abutting intervals:
  # ideally, we should merge them, but I do not see an easy 
  # way to do so.
  u
} )

EDIT: Another, conceptually cleaner solution, that fixes the non-merged abutting intervals problem, is to count the number of open sessions for each user and machine: when it stops being zero, the user has logged in (with one or more session), when it drops to zero, the user has closed all his/her sessions.

ddply( d, c("username", "machine"), function (u) {
  a <- rbind( 
    data.frame( time = min(u$start) - 1, sessions = 0 ),
    data.frame( time = u$start, sessions = 1 ), 
    data.frame( time = u$end,   sessions = -1 ) 
  )
  a <- a[ order(a$time), ]
  a$sessions <- cumsum(a$sessions)
  a$previous <- c( 0, a$sessions[ - nrow(a) ] )
  a <- a[ a$previous == 0 & a$sessions  > 0 | 
          a$previous  > 0 & a$sessions == 0, ]
  a$previous_time <- a$time
  a$previous_time[-1] <- a$time[ -nrow(a) ]
  a <- a[ a$previous > 0 & a$sessions == 0, ]
  a <- data.frame( 
    username = u$username[1],
    machine  = u$machine[1],
    start = a$previous_time,
    end   = a$time
  )
  a
} )
like image 39
Vincent Zoonekynd Avatar answered Sep 30 '22 06:09

Vincent Zoonekynd