Edit: I'm looking for solution for this question now also with other programming languages.
Based on the other question I asked, I have a dataset like this (for R users, dput for this below) which represents user computer sessions:
username machine start end
1 user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27
2 user1 D5599.domain.com 2011-01-03 09:46:29 2011-01-03 10:09:16
3 user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17
4 user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15
5 user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16
6 user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23
7 user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22
8 user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59
9 user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43
10 USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53
11 USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22
12 USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43
13 USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38
14 USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43
15 USER2 D5682.domain.com 2011-02-14 14:59:23 2011-02-14 15:14:47
>
There may be several concurrent (overlapping based on time) sessions for the same username from the the same computer. How can I remove those rows so that only one sessions is left for this data? Original data set has approx. 500 000 rows.
The expected output is (rows 2, 15 removed)
username machine start end
1 user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27
3 user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17
4 user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15
5 user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16
6 user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23
7 user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22
8 user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59
9 user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43
10 USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53
11 USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22
12 USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43
13 USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38
14 USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43
>
Here is the dataset:
structure(list(username = c("user1", "user1", "user1",
"user1", "user1", "user1", "user1", "user1",
"user1", "USER2", "USER2", "USER2", "USER2", "USER2", "USER2"
), machine = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("D5599.domain.com", "D5645.domain.com",
"D5682.domain.com", "D5686.domain.com", "D5694.domain.com", "D5696.domain.com",
"D5772.domain.com", "D5772.domain.com", "D5847.domain.com", "D5855.domain.com",
"D5871.domain.com", "D5927.domain.com", "D5927.domain.com", "D5952.domain.com",
"D5993.domain.com", "D6012.domain.com", "D6048.domain.com", "D6077.domain.com",
"D5688.domain.com", "D5815.domain.com", "D6106.domain.com", "D6128.domain.com"
), class = "factor"), start = structure(c(1294040658, 1294040789,
1294056456, 1294232597, 1297686819, 1298462070, 1300695018, 1307603561,
1294049025, 1294835165, 1295269579, 1295356050, 1295961655, 1297688580,
1297688363), class = c("POSIXct", "POSIXt"), tzone = ""), end =
structure(c(1294040847,
1294042156, 1294059377, 1294233795, 1297687216, 1298462303, 1300696342,
1307606339, 1294050583, 1294835573, 1295271862, 1295358163, 1295961878,
1297688863, 1297689287), class = c("POSIXct", "POSIXt"), tzone = "")),
.Names = c("username",
"machine", "start", "end"), row.names = c(NA, 15L), class = "data.frame")
Following is complete algorithm. 1) Sort all intervals in increasing order of start time. This step takes O(nLogn) time. 2) In the sorted array, if start time of an interval is less than end of previous interval, then there is an overlap.
Let's take the following overlapping intervals example to explain the idea: If both ranges have at least one common point, then we say that they're overlapping. In other words, we say that two ranges and are overlapping if: On the other hand, non-overlapping ranges don't have any points in common.
The basic idea is 1) first take input_start to test_start (if both of them are not equal and input_start is min) 2) always take test_start and test_end 3) take test_end to input_end if test_end is less than input end (and end_input and end_test are not equal).
Java Program to find overlapping intervals among a given set of intervals. In this approach, we are going to take a count array and for each given interval start time we will do count[startTime]++ and for end time, do count[endTime]--. sum > 2 it means there is overlapping intervals.
Try the intervals package:
library(intervals)
f <- function(dd) with(dd, {
r <- reduce(Intervals(cbind(start, end)))
data.frame(username = username[1],
machine = machine[1],
start = structure(r[, 1], class = class(start)),
end = structure(r[, 2], class = class(end)))
})
do.call("rbind", by(d, d[1:2], f))
With the sample data this reduces the 15 rows to the following 13 rows (by combining rows 1 and 2 and rows 12 and 13 in the original data frame):
username machine start end
1 user1 D5599.domain.com 2011-01-03 02:44:18 2011-01-03 03:09:16
2 user1 D5599.domain.com 2011-01-03 07:07:36 2011-01-03 07:56:17
3 user1 D5599.domain.com 2011-01-05 08:03:17 2011-01-05 08:23:15
4 user1 D5599.domain.com 2011-02-14 07:33:39 2011-02-14 07:40:16
5 user1 D5599.domain.com 2011-02-23 06:54:30 2011-02-23 06:58:23
6 user1 D5599.domain.com 2011-03-21 04:10:18 2011-03-21 04:32:22
7 user1 D5645.domain.com 2011-06-09 03:12:41 2011-06-09 03:58:59
8 user1 D5682.domain.com 2011-01-03 05:03:45 2011-01-03 05:29:43
9 USER2 D5682.domain.com 2011-01-12 07:26:05 2011-01-12 07:32:53
10 USER2 D5682.domain.com 2011-01-17 08:06:19 2011-01-17 08:44:22
11 USER2 D5682.domain.com 2011-01-18 08:07:30 2011-01-18 08:42:43
12 USER2 D5682.domain.com 2011-01-25 08:20:55 2011-01-25 08:24:38
13 USER2 D5682.domain.com 2011-02-14 07:59:23 2011-02-14 08:14:47
One solution is to first split the intervals, so that they are sometimes equal but never partially overlap, and them remove the duplicates. The problem is that we are left with many small abutting intervals, and merging them does not look straightforward.
library(reshape2)
library(sqldf)
d$machine <- as.character( d$machine ) # Duplicated levels...
ddply( d, c("username", "machine"), function (u) {
# For each username and machine,
# compute all the possible non-overlapping intervals
intervals <- sort(unique( c(u$start, u$end) ))
intervals <- data.frame(
start = intervals[-length(intervals)],
end = intervals[-1]
)
# Only retain those actually in the data
u <- sqldf( "
SELECT DISTINCT u.username, u.machine,
intervals.start, intervals.end
FROM u, intervals
WHERE u.start <= intervals.start
AND intervals.end <= u.end
" )
# We have non-overlapping, but potentially abutting intervals:
# ideally, we should merge them, but I do not see an easy
# way to do so.
u
} )
EDIT: Another, conceptually cleaner solution, that fixes the non-merged abutting intervals problem, is to count the number of open sessions for each user and machine: when it stops being zero, the user has logged in (with one or more session), when it drops to zero, the user has closed all his/her sessions.
ddply( d, c("username", "machine"), function (u) {
a <- rbind(
data.frame( time = min(u$start) - 1, sessions = 0 ),
data.frame( time = u$start, sessions = 1 ),
data.frame( time = u$end, sessions = -1 )
)
a <- a[ order(a$time), ]
a$sessions <- cumsum(a$sessions)
a$previous <- c( 0, a$sessions[ - nrow(a) ] )
a <- a[ a$previous == 0 & a$sessions > 0 |
a$previous > 0 & a$sessions == 0, ]
a$previous_time <- a$time
a$previous_time[-1] <- a$time[ -nrow(a) ]
a <- a[ a$previous > 0 & a$sessions == 0, ]
a <- data.frame(
username = u$username[1],
machine = u$machine[1],
start = a$previous_time,
end = a$time
)
a
} )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With