Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to combine two rows and calculate the time difference between two timestamp values in MySQL?

Tags:

sql

mysql

I have a situation that I'm sure is quite common and it's really bothering me that I can't figure out how to do it or what to search for to find a relevant example/solution. I'm relatively new to MySQL (have been using MSSQL and PostgreSQL earlier) and every approach I can think of is blocked by some feature lacking in MySQL.

I have a "log" table that simply lists many different events with their timestamp (stored as datetime type). There's lots of data and columns in the table not relevant to this problem, so lets say we have a simple table like this:

CREATE TABLE log (  
  id INT NOT NULL AUTO_INCREMENT,  
  name VARCHAR(16),  
  ts DATETIME NOT NULL,  
  eventtype VARCHAR(25),  
  PRIMARY KEY  (id)  
)

Let's say that some rows have an eventtype = 'start' and others have an eventtype = 'stop'. What I want to do is to somehow couple each "startrow" with each "stoprow" and find the time difference between the two (and then sum the durations per each name, but that's not where the problem lies). Each "start" event should have a corresponding "stop" event occuring at some stage later then the "start" event, but because of problems/bugs/crashed with the data collector it could be that some are missing. In that case I would like to disregard the event without a "partner". That means that given the data:

foo, 2010-06-10 19:45, start  
foo, 2010-06-10 19:47, start  
foo, 2010-06-10 20:13, stop

..I would like to just disregard the 19:45 start event and not just get two result rows both using the 20:13 stop event as the stop time.

I've tried to join the table with itself in different ways, but the key problems for me seems to be to find a way to correctly identify the corresponding "stop" event to the "start" event for the given "name". The problem is exactly the same as you would have if you had table with employees stamping in and out of work and wanted to find out how much they actually were at work. I'm sure there must be well known solutions to this, but I can't seem to find them...

like image 580
Nadar Avatar asked Jun 10 '10 18:06

Nadar


3 Answers

If you don't mind creating a temporary table*, then I think the following should work well. I have tested it with 120,000 records, and the whole process completes in under 6 seconds. With 1,048,576 records it completed in just under 66 seconds - and that's on an old Pentium III with 128MB RAM:

*In MySQL 5.0 (and perhaps other versions) the temporary table cannot be a true MySQL temporary table, as you cannot refer to a TEMPORARY table more than once in the same query. See here:

http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html

Instead, just drop/create a normal table, as follows:

DROP TABLE IF EXISTS `tmp_log`;
CREATE TABLE `tmp_log` (
    `id` INT NOT NULL,
    `row` INT NOT NULL,
    `name` VARCHAR(16),
    `ts` DATETIME NOT NULL,
    `eventtype` VARCHAR(25),
    INDEX `row` (`row` ASC),
    INDEX `eventtype` (`eventtype` ASC)
);

This table is used to store a sorted and numbered list of rows from the following SELECT query:

INSERT INTO `tmp_log` (
    `id`,
    `row`,
    `name`,
    `ts`,
    `eventtype`
)
SELECT
    `id`,
    @row:=@row+1,
    `name`,
    `ts`,
    `eventtype`
FROM log,
(SELECT @row:=0) row_count
ORDER BY `name`, `id`;

The above SELECT query sorts the rows by name and then id (you could use the timestamp instead of the id, just so long as the start events appear before the stop events). Each row is also numbered. By doing this, matching pairs of events are always next to each other, and the row number of the start event is always one less than the row id of the stop event.

Now select the matching pairs from the list:

SELECT
    start_log.row AS start_row,
    stop_log.row AS stop_row,
    start_log.name AS name,
    start_log.eventtype AS start_event,
    start_log.ts AS start_time,
    stop_log.eventtype AS stop_event,
    stop_log.ts AS end_time,
    TIMEDIFF(stop_log.ts, start_log.ts) AS duration
FROM
    tmp_log AS start_log
INNER JOIN tmp_log AS stop_log
    ON start_log.row+1 = stop_log.row
    AND start_log.name = stop_log.name
    AND start_log.eventtype = 'start'
    AND stop_log.eventtype = 'stop'
ORDER BY start_log.id;

Once you're done, it's probably a good idea to drop the temporary table:

DROP TABLE IF EXISTS `tmp_log`;row

UPDATE

You could try the following idea, which eliminates temp tables and joins altogether by using variables to store values from the previous row. It sorts the rows by name then time stamp, which groups all values with the same name together, and puts each group in time order. I think that this should ensure that all corresponding start/stop events are next to each other.

SELECT id, name, start, stop, TIMEDIFF(stop, start) AS duration FROM (
    SELECT
        id, ts, eventtype,
        (@name <> name) AS new_name,
        @start AS start,
        @start := IF(eventtype = 'start', ts, NULL) AS prev_start,
        @stop  := IF(eventtype = 'stop',  ts, NULL) AS stop,
        @name  := name AS name
    FROM table1 ORDER BY name, ts
) AS tmp, (SELECT @start:=NULL, @stop:=NULL, @name:=NULL) AS vars
WHERE new_name = 0 AND start IS NOT NULL AND stop IS NOT NULL;

I don't know how it will compare to Ivar Bonsaksen's method, but it runs fairly fast on my box.

Here's how I created the test data:

CREATE TABLE  `table1` (
    `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
    `name` VARCHAR(5),
    `ts` DATETIME,
    `eventtype` VARCHAR(5),
    PRIMARY KEY (`id`),
    INDEX `name` (`name`),
    INDEX `ts` (`ts`)
) ENGINE=InnoDB;

DELIMITER //
DROP PROCEDURE IF EXISTS autofill//
CREATE PROCEDURE autofill()
BEGIN
    DECLARE i INT DEFAULT 0;
    WHILE i < 1000000 DO
        INSERT INTO table1 (name, ts, eventtype) VALUES (
            CHAR(FLOOR(65 + RAND() * 26)),
            DATE_ADD(NOW(),
            INTERVAL FLOOR(RAND() * 365) DAY),
            IF(RAND() >= 0.5, 'start', 'stop')
        );
        SET i = i + 1;
    END WHILE;
END;
//
DELIMITER ;

CALL autofill();
like image 76
Mike Avatar answered Sep 22 '22 10:09

Mike


I believe this could be a simpler way to reach your goal:

SELECT
    start_log.name,
    MAX(start_log.ts) AS start_time,
    end_log.ts AS end_time,
    TIMEDIFF(MAX(start_log.ts), end_log.ts)
FROM
    log AS start_log
INNER JOIN
    log AS end_log ON (
            start_log.name = end_log.name
        AND
            end_log.ts > start_log.ts)
WHERE start_log.eventtype = 'start'
AND end_log.eventtype = 'stop'
GROUP BY start_log.name

It should run considerably faster as it eliminates one subquery.

like image 23
Ivar Bonsaksen Avatar answered Sep 19 '22 10:09

Ivar Bonsaksen


Can you change the data collector? If yes, add a group_id field (with an index) into the log table and write the id of the start event into it (same id for start and end in the group_id). Then you can do

SELECT S.id, S.name, TIMEDIFF(E.ts, S.ts) `diff`
FROM `log` S
    JOIN `log` E ON S.id = E.group_id AND E.eventtype = 'end'
WHERE S.eventtype = 'start'
like image 41
andrem Avatar answered Sep 23 '22 10:09

andrem