I have a datatable with many rows and I would like to conditionally group two columns, namely Begin and End. These columns stand for a certain month in which the associated person was doing something. Here is some sample data (you can use R to read in, or find the pure tables below if you dont use R):
# base:
test <- read.table(
text = "
1 A mnb USA prim 4 12
2 A mnb USA x 13 15
3 A mnb USA un 16 25
4 A mnb USA fdfds 1 2
5 B ghf CAN sdg 3 27
6 B ghf CAN hgh 28 29
7 B ghf CAN y 24 31
8 B ghf CAN ghf 38 42
",header=F)
library(data.table)
setDT(test)
names(test) <- c("row","Person","Name","Country","add info","Begin","End")
out <- read.table(
text = "
1 A mnb USA fdfds 1 2
2 A mnb USA - 4 25
3 B ghf CAN - 3 31
4 B ghf CAN ghf 38 42
",header=F)
setDT(out)
names(out) <- c("row","Person","Name","Country","add info","Begin","End")
The grouping should be done as follows: If person A did hiking from month 4 to month 15 and travelling from month 16 to month 24, I would group the consecutive (i.e. without break) activity from month 4 to month 24. If afterwards person A did surfing from month 25 to month 28, I would also add this, and the whole group activity would last from 4 to 28. Now problematic are cases were there are overlapping periods, for example person A might also do fishing from 11 to 31, so the whole thing would become 4 to 31. However, if person A did something from 1 to 2, that would be a separate activity (as compared to 1 to 3, which would also have to be added, because 3 is connected to 4). I hope that was clear, if not you can find more examples in the above code. I am using datatable, because my dataset is quite large. I have started with sqldf so far, but it's problematic if you have so many activities per person (let's say 8 or more). Can this be done in datatable, or plyr, or sqldf? Please note: I am also looking for an answer in SQL because I could use that directly in sqldf or try to convert it to another language. sqldf supports (1) the SQLite backend database (by default), (2) the H2 java database, (3) the PostgreSQL database and (4) sqldf 0.4-0 onwards also supports MySQL.
Edit: Here are the 'pure' tables:
In:
Person Name Country add info Begin End
A mnb USA prim 4 12
A mnb USA x 13 15
A mnb USA un 16 25
A mnb USA fdfds 1 2
B ghf CAN sdg 3 27
B ghf CAN hgh 28 29
B ghf CAN y 24 31
B ghf CAN ghf 38 42
Out:
A mnb USA fdfds 1 2
A mnb USA - 4 25
B ghf CAN - 3 31
B ghf CAN ghf 38 42
I did this one which worked in my tests and almost all the main databases out there should normally run it... I underscored my columns... please, change the names before test:
SELECT
r1.person_,
r1.name_,
r1.country_,
CASE
WHEN max(r2.begin_) = max(r1.begin_)
THEN max(r1.info_) ELSE '-'
END info_,
MAX(r2.begin_) begin_,
r1.end_
FROM stack_39626781 r1
INNER JOIN stack_39626781 r2 ON 1=1
AND r2.person_ = r1.person_
AND r2.begin_ <= r1.begin_ -- just optimizing...
LEFT JOIN stack_39626781 r3 ON 1=1
AND r3.person_ = r1.person_
-- matches when another range overlaps this range end
AND r3.end_ >= r1.end_ + 1
AND r3.begin_ <= r1.end_ + 1
LEFT JOIN stack_39626781 r4 ON 1=1
AND r4.person_ = r2.person_
-- matches when another range overlaps this range begin
AND r4.end_ >= r2.begin_ - 1
AND r4.begin_ <= r2.begin_ - 1
WHERE 1=1
-- get rows
-- with no overlaps on end range and
-- with no overlaps on begin range
AND r3.person_ IS NULL
AND r4.person_ IS NULL
GROUP BY
r1.person_,
r1.name_,
r1.country_,
r1.end_
This query is based on the fact that any range from output have no connections/overlaps. Lets say that, for an output of five ranges, five begin
s and five end
s exists with no connections/overlaps. Find and associate them should be easier than generating all connections/overlaps. So, what this query does is:
end
value;begin
value;person
and end
, the correct begin
pair is the maximum one available which value is equal or lesser than this end
... it's easy to validate this rule... first, you can't have a begin
greater than an end
... also, if you have two or more possible begin
s lesser than end
, e. g., begin1 = end - 2 and begin2 = end - 5, selecting the lesser one (begin2) makes the greater one (begin1) an overlap of this range. Hope it helps.
If you are working with SQL Server 2012 or above, you can use the LAG and LEAD functions to build up your logic to arrive at your final desired dataset. Those functions have also been available in Oracle since Oracle 8i, I believe.
Below is a solution that I created in SQL Server 2012, which should help you. The example values you provided are loaded into a temporary table to represent what you referred to as your first "pure table." Using those two functions, along with OVER clause, I arrived at your final dataset with the following T-SQL code below. I left some of the commented out lines in the code to show how I was able to build up the overall solution piece-by-piece, which accounts for the various scenarios placed into the CASE statement for the GapMarker column that acts as grouping flag.
IF OBJECT_ID('tempdb..#MyTable') IS NOT NULL
DROP TABLE #MyTable
CREATE TABLE #MyTable (
Person CHAR(1)
,[Name] VARCHAR(3)
,Country VARCHAR(10)
,add_info VARCHAR(10)
,[Begin] INT
,[End] INT
)
INSERT INTO #MyTable (Person, Name, Country, add_info, [Begin], [End])
VALUES ('A', 'mnb', 'USA', 'prim', 4, 12),
('A', 'mnb', 'USA', 'x', 13, 15),
('A', 'mnb', 'USA', 'un', 16, 25),
('A', 'mnb', 'USA', 'fdfds', 1, 2),
('B', 'ghf', 'CAN', 'sdg', 3, 27),
('B', 'ghf', 'CAN', 'hgh', 28, 29),
('B', 'ghf', 'CAN', 'y', 24, 31),
('B', 'ghf', 'CAN', 'ghf', 38, 42);
WITH CTE
AS
(SELECT
mt.Person
,mt.Name
,mt.Country
,mt.add_info
,mt.[Begin]
,mt.[End]
--,LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
--,CASE WHEN [End] + 1 = LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
-- --AND LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]) = LEAD([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
-- THEN 1
-- ELSE 0
-- END AS Grp
--,MARKER = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End]))
,CASE
WHEN mt.[End] + 1 = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])) OR
1 + COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])) = mt.[Begin] OR
COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])) BETWEEN mt.[Begin] AND mt.[End] OR
[End] BETWEEN LAG([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]) AND LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]) THEN 1
ELSE 0
END AS GapMarker
,InBetween = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]))
,EndInBtw = LAG([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])
,LagEndInBtw = LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])
FROM #MyTable mt
--ORDER BY mt.Person, mt.[Begin]
)
SELECT DISTINCT
X.Person
,X.[Name]
,X.Country
,t.add_info
,X.MinBegin
,X.MaxEnd
FROM (SELECT
c.Person
,c.[Name]
,c.Country
,c.add_info
,c.[Begin]
,c.[End]
,c.GapMarker
,c.InBetween
,c.EndInBtw
,c.LagEndInBtw
,MIN(c.[Begin]) OVER (PARTITION BY c.Person, c.GapMarker ORDER BY c.Person) AS MinBegin
,MAX(c.[End]) OVER (PARTITION BY c.Person, c.GapMarker ORDER BY c.Person) AS MaxEnd
--, CASE WHEN c.[End]+1 = c.MARKER
-- OR c.MARKER +1 = c.[Begin]
-- THEN 1
-- ELSE 0
-- END Grp
FROM CTE AS c) X
LEFT JOIN #MyTable AS t
ON t.[Begin] = X.[MinBegin]
AND t.[End] = X.[MaxEnd]
AND t.Person = X.Person
ORDER BY X.Person, X.MinBegin
--ORDER BY Person, [Begin]
And here's a screen shot of the results that match your desired final dataset:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With