Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group consecutive ranges

I have a datatable with many rows and I would like to conditionally group two columns, namely Begin and End. These columns stand for a certain month in which the associated person was doing something. Here is some sample data (you can use R to read in, or find the pure tables below if you dont use R):

# base:
test <- read.table(
text = "
1   A   mnb USA prim    4   12
2   A   mnb USA x   13  15
3   A   mnb USA un  16  25
4   A   mnb USA fdfds   1   2
5   B   ghf CAN sdg 3   27
6   B   ghf CAN hgh 28  29
7   B   ghf CAN y   24  31
8   B   ghf CAN ghf 38  42
",header=F)
library(data.table)
setDT(test)
names(test) <-  c("row","Person","Name","Country","add info","Begin","End")
out <- read.table(
text = "
1   A   mnb USA fdfds   1   2
2   A   mnb USA -   4   25
3   B   ghf CAN -   3   31
4   B   ghf CAN ghf 38  42
",header=F)
setDT(out)
names(out) <- c("row","Person","Name","Country","add info","Begin","End")

The grouping should be done as follows: If person A did hiking from month 4 to month 15 and travelling from month 16 to month 24, I would group the consecutive (i.e. without break) activity from month 4 to month 24. If afterwards person A did surfing from month 25 to month 28, I would also add this, and the whole group activity would last from 4 to 28. Now problematic are cases were there are overlapping periods, for example person A might also do fishing from 11 to 31, so the whole thing would become 4 to 31. However, if person A did something from 1 to 2, that would be a separate activity (as compared to 1 to 3, which would also have to be added, because 3 is connected to 4). I hope that was clear, if not you can find more examples in the above code. I am using datatable, because my dataset is quite large. I have started with sqldf so far, but it's problematic if you have so many activities per person (let's say 8 or more). Can this be done in datatable, or plyr, or sqldf? Please note: I am also looking for an answer in SQL because I could use that directly in sqldf or try to convert it to another language. sqldf supports (1) the SQLite backend database (by default), (2) the H2 java database, (3) the PostgreSQL database and (4) sqldf 0.4-0 onwards also supports MySQL.

Edit: Here are the 'pure' tables:

In:

Person Name Country add info  Begin End
A      mnb  USA     prim      4      12
A      mnb  USA     x         13     15
A      mnb  USA     un        16     25
A      mnb  USA     fdfds     1      2
B      ghf  CAN     sdg       3      27
B      ghf  CAN     hgh       28     29
B      ghf  CAN     y         24     31
B      ghf  CAN     ghf       38     42

Out:

A      mnb  USA     fdfds     1      2
A      mnb  USA     -         4      25
B      ghf  CAN     -         3      31
B      ghf  CAN     ghf       38     42
like image 353
user3032689 Avatar asked Sep 21 '16 21:09

user3032689


2 Answers

I did this one which worked in my tests and almost all the main databases out there should normally run it... I underscored my columns... please, change the names before test:

SELECT 
  r1.person_,
  r1.name_,
  r1.country_,
  CASE  
    WHEN max(r2.begin_) = max(r1.begin_) 
    THEN max(r1.info_) ELSE '-' 
  END info_,
  MAX(r2.begin_) begin_, 
  r1.end_ 
FROM stack_39626781 r1
INNER JOIN stack_39626781 r2 ON 1=1
  AND r2.person_ = r1.person_
  AND r2.begin_ <= r1.begin_ -- just optimizing...
LEFT JOIN stack_39626781 r3 ON 1=1
  AND r3.person_ = r1.person_
  -- matches when another range overlaps this range end
  AND r3.end_   >= r1.end_ + 1
  AND r3.begin_ <= r1.end_ + 1
LEFT JOIN stack_39626781 r4 ON 1=1
  AND r4.person_ = r2.person_
  -- matches when another range overlaps this range begin
  AND r4.end_   >= r2.begin_ - 1
  AND r4.begin_ <= r2.begin_ - 1
WHERE 1=1
  -- get rows 
  -- with no overlaps on end range and
  -- with no overlaps on begin range
  AND r3.person_ IS NULL
  AND r4.person_ IS NULL
GROUP BY
  r1.person_,
  r1.name_,
  r1.country_,
  r1.end_ 

This query is based on the fact that any range from output have no connections/overlaps. Lets say that, for an output of five ranges, five begins and five ends exists with no connections/overlaps. Find and associate them should be easier than generating all connections/overlaps. So, what this query does is:

  1. Find all ranges per person with no overlaps/connections on its end value;
  2. Find all ranges per person with no overlaps/connections on its begin value;
  3. These are the valid ranges, so associate them all to find the correct pair;
  4. For each person and end, the correct begin pair is the maximum one available which value is equal or lesser than this end... it's easy to validate this rule... first, you can't have a begin greater than an end... also, if you have two or more possible begins lesser than end, e. g., begin1 = end - 2 and begin2 = end - 5, selecting the lesser one (begin2) makes the greater one (begin1) an overlap of this range.

Hope it helps.

like image 57
Felypp Oliveira Avatar answered Oct 14 '22 01:10

Felypp Oliveira


If you are working with SQL Server 2012 or above, you can use the LAG and LEAD functions to build up your logic to arrive at your final desired dataset. Those functions have also been available in Oracle since Oracle 8i, I believe.

Below is a solution that I created in SQL Server 2012, which should help you. The example values you provided are loaded into a temporary table to represent what you referred to as your first "pure table." Using those two functions, along with OVER clause, I arrived at your final dataset with the following T-SQL code below. I left some of the commented out lines in the code to show how I was able to build up the overall solution piece-by-piece, which accounts for the various scenarios placed into the CASE statement for the GapMarker column that acts as grouping flag.

IF OBJECT_ID('tempdb..#MyTable') IS NOT NULL
	DROP TABLE #MyTable

CREATE TABLE #MyTable (
	 Person CHAR(1)
	,[Name] VARCHAR(3)
	,Country VARCHAR(10)
	,add_info VARCHAR(10)
	,[Begin] INT
	,[End] INT
)

INSERT INTO #MyTable (Person, Name, Country, add_info, [Begin], [End])
	VALUES ('A', 'mnb', 'USA', 'prim', 4, 12),
	('A', 'mnb', 'USA', 'x', 13, 15),
	('A', 'mnb', 'USA', 'un', 16, 25),
	('A', 'mnb', 'USA', 'fdfds', 1, 2),
	('B', 'ghf', 'CAN', 'sdg', 3, 27),
	('B', 'ghf', 'CAN', 'hgh', 28, 29),
	('B', 'ghf', 'CAN', 'y', 24, 31),
	('B', 'ghf', 'CAN', 'ghf', 38, 42);

WITH CTE
AS
(SELECT
		mt.Person
		,mt.Name
		,mt.Country
		,mt.add_info
		,mt.[Begin]
		,mt.[End]
		--,LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
		--,CASE WHEN [End] + 1 = LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
		--      --AND LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]) = LEAD([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
		--	  THEN 1
		--	  ELSE 0
		--  END AS Grp
		--,MARKER = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End]))
		,CASE
			WHEN mt.[End] + 1 = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])) OR
				1 + COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])) = mt.[Begin] OR
				COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])) BETWEEN mt.[Begin] AND mt.[End] OR
				[End] BETWEEN LAG([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]) AND LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]) THEN 1
			ELSE 0
		END AS GapMarker
		,InBetween = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]))
		,EndInBtw = LAG([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])
		,LagEndInBtw = LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])
	FROM #MyTable mt
--ORDER BY mt.Person, mt.[Begin]
)
SELECT DISTINCT
	X.Person
	,X.[Name]
	,X.Country
	,t.add_info
	,X.MinBegin
	,X.MaxEnd
FROM (SELECT
		c.Person
		,c.[Name]
		,c.Country
		,c.add_info
		,c.[Begin]
		,c.[End]
		,c.GapMarker
		,c.InBetween
		,c.EndInBtw
		,c.LagEndInBtw
		,MIN(c.[Begin]) OVER (PARTITION BY c.Person, c.GapMarker ORDER BY c.Person) AS MinBegin
		,MAX(c.[End]) OVER (PARTITION BY c.Person, c.GapMarker ORDER BY c.Person) AS MaxEnd
	--, CASE WHEN c.[End]+1 = c.MARKER
	--        OR c.MARKER +1 = c.[Begin] 
	--  THEN 1
	--  ELSE 0
	--  END Grp
	FROM CTE AS c) X
LEFT JOIN #MyTable AS t
	ON t.[Begin] = X.[MinBegin]
		AND t.[End] = X.[MaxEnd]
		AND t.Person = X.Person
ORDER BY X.Person, X.MinBegin
--ORDER BY Person, [Begin]

And here's a screen shot of the results that match your desired final dataset:

enter image description here

like image 32
fausto.io Avatar answered Oct 14 '22 00:10

fausto.io