I have multiple records with sparsely populated identifiers (I will call these ID Numbers). I can have a maximum of two different ID Numbers per record and want to be able to traverse all the related records together so that I can create a single shared identifier. I want to achieve this in a T-SQL query.
Essentially, here is some sample data:
+-------+-------+--------+-----+------+
| RowId | ID1 | ID2 | ID3 | ID4 |
+-------+-------+--------+-----+------+
| 1 | 11111 | | | |
| 2 | 11111 | | | |
| 3 | 11111 | AAAAA | | |
| 4 | | BBBBBB | BC1 | |
| 5 | | | BC1 | O111 |
| 6 | | GGGGG | BC1 | |
| 7 | | AAAAA | | O111 |
| 8 | | CCCCCC | | |
| 9 | 99999 | | | |
| 10 | 99999 | DDDDDD | | |
| 11 | | | | O222 |
| 12 | | EEEEEE | | O222 |
| 13 | | EEEEEE | | O333 |
+-------+-------+--------+-----+------+
So for example, 11111 is linked to AAAAA in RowId3, and AAAAA is also linked to O111 in rowId 7. O111 is linked to BC1 in RowId 5. BC1 is linked to BBBBBB in RowId 4, etc. Also, I want to create a new single identifier once all of these rows are linked.
Here is the output I want to achieve for all of the data above:
Denormalised:
+---------+-------+--------+-----+------+
| GroupId | ID1 | ID2 | ID3 | ID4 |
+---------+-------+--------+-----+------+
| 1 | 11111 | AAAAA | BC1 | O111 |
| 1 | 11111 | BBBBBB | BC1 | O111 |
| 1 | 11111 | GGGGG | BC1 | O111 |
| 2 | | CCCCCC | | |
| 3 | 99999 | DDDDDD | | |
| 4 | | EEEEEE | | O222 |
| 4 | | EEEEEE | | O333 |
+---------+-------+--------+-----+------+
Normalized (probably better to work with):
+--------+----------+---------+
| IDType | IDNumber | GroupId |
+--------+----------+---------+
| ID1 | 11111 | 1 |
| ID2 | AAAAA | 1 |
| ID2 | BBBBBB | 1 |
| ID2 | GGGGG | 1 |
| ID3 | BC1 | 1 |
| ID4 | O111 | 1 |
| ID2 | CCCCCC | 2 |
| ID1 | 99999 | 3 |
| ID2 | DDDDDD | 3 |
| ID2 | EEEEEE | 4 |
| ID4 | O222 | 4 |
| ID4 | O333 | 4 |
+--------+----------+---------+
I am looking for SQL code to generate the output above or similar normalized structure. Thanks.
EDIT: Here is some code to create data that matches the sample data in the table above.
DROP TABLE IF EXISTS #ID
CREATE TABLE #ID
(
RowId INT,
ID1 VARCHAR(100),
ID2 VARCHAR(100),
ID3 VARCHAR(100),
ID4 VARCHAR(100)
)
INSERT INTO #ID VALUES
(1,'11111',NULL,NULL,NULL),
(2,'11111',NULL,NULL,NULL),
(3,'11111','AAAAA',NULL,NULL),
(4,NULL,'BBBBBB','BC1',NULL),
(5,NULL,NULL,'BC1','O111'),
(6,NULL,'GGGGG','BC1',NULL),
(7,NULL,'AAAAA',NULL,'O111'),
(8,NULL,'CCCCCC',NULL,NULL),
(9,'99999',NULL,NULL,NULL),
(10,'99999','DDDDDD',NULL,NULL),
(11,NULL,NULL,NULL,'O222'),
(12,NULL,'EEEEEE',NULL,'O222'),
(13,NULL,'EEEEEE',NULL,'O333')
It is easy to get your normalized output.
I'm using my query from How to find all connected subgraphs of an undirected graph with minor modification to convert your data into pairs that define edges of a graph. The query treats the data as edges in a graph and traverses recursively all edges of the graph, stopping when the loop is detected. Then it puts all found loops in groups and gives each group a number.
Your source table has four IDs, but each row can have only two IDs, so we know that each row has a pair of IDs. My query expects this kind of data (pairs of IDs). It is easy to convert four IDs into a pair - use COALESCE
.
For detailed explanation of how it works, see How to find all connected subgraphs of an undirected graph.
Query
WITH
CTE_Idents
AS
(
SELECT ID1 AS Ident, 'ID1' AS IDType
FROM @T
UNION
SELECT ID2 AS Ident, 'ID2' AS IDType
FROM @T
UNION
SELECT ID3 AS Ident, 'ID3' AS IDType
FROM @T
UNION
SELECT ID4 AS Ident, 'ID4' AS IDType
FROM @T
)
,CTE_Pairs
AS
(
SELECT COALESCE(ID1, ID2, ID3, ID4) AS Ident1, COALESCE(ID4, ID3, ID2, ID1) AS Ident2
FROM @T
UNION
SELECT COALESCE(ID4, ID3, ID2, ID1) AS Ident1, COALESCE(ID1, ID2, ID3, ID4) AS Ident2
FROM @T
)
,CTE_Recursive
AS
(
SELECT
CAST(CTE_Idents.Ident AS varchar(8000)) AS AnchorIdent
, Ident1
, Ident2
, CAST(',' + Ident1 + ',' + Ident2 + ',' AS varchar(8000)) AS IdentPath
, 1 AS Lvl
FROM
CTE_Pairs
INNER JOIN CTE_Idents ON CTE_Idents.Ident = CTE_Pairs.Ident1
UNION ALL
SELECT
CTE_Recursive.AnchorIdent
, CTE_Pairs.Ident1
, CTE_Pairs.Ident2
, CAST(CTE_Recursive.IdentPath + CTE_Pairs.Ident2 + ',' AS varchar(8000)) AS IdentPath
, CTE_Recursive.Lvl + 1 AS Lvl
FROM
CTE_Pairs
INNER JOIN CTE_Recursive ON CTE_Recursive.Ident2 = CTE_Pairs.Ident1
WHERE
CTE_Recursive.IdentPath NOT LIKE CAST('%,' + CTE_Pairs.Ident2 + ',%' AS varchar(8000))
)
,CTE_RecursionResult
AS
(
SELECT AnchorIdent, Ident1, Ident2
FROM CTE_Recursive
)
,CTE_CleanResult
AS
(
SELECT AnchorIdent, Ident1 AS Ident
FROM CTE_RecursionResult
UNION
SELECT AnchorIdent, Ident2 AS Ident
FROM CTE_RecursionResult
)
SELECT
CTE_Idents.IDType
,CTE_Idents.Ident
,CASE WHEN CA_Data.XML_Value IS NULL
THEN CTE_Idents.Ident ELSE CA_Data.XML_Value END AS GroupMembers
,DENSE_RANK() OVER(ORDER BY
CASE WHEN CA_Data.XML_Value IS NULL
THEN CTE_Idents.Ident ELSE CA_Data.XML_Value END
) AS GroupID
FROM
CTE_Idents
CROSS APPLY
(
SELECT CTE_CleanResult.Ident+','
FROM CTE_CleanResult
WHERE CTE_CleanResult.AnchorIdent = CTE_Idents.Ident
ORDER BY CTE_CleanResult.Ident FOR XML PATH(''), TYPE
) AS CA_XML(XML_Value)
CROSS APPLY
(
SELECT CA_XML.XML_Value.value('.', 'NVARCHAR(MAX)')
) AS CA_Data(XML_Value)
WHERE
CTE_Idents.Ident IS NOT NULL
ORDER BY GroupID, IDType, Ident;
Result
+--------+--------+------------------------------------+---------+
| IDType | Ident | GroupMembers | GroupID |
+--------+--------+------------------------------------+---------+
| ID1 | 11111 | 11111,AAAAA,BBBBBB,BC1,GGGGG,O111, | 1 |
| ID2 | AAAAA | 11111,AAAAA,BBBBBB,BC1,GGGGG,O111, | 1 |
| ID2 | BBBBBB | 11111,AAAAA,BBBBBB,BC1,GGGGG,O111, | 1 |
| ID2 | GGGGG | 11111,AAAAA,BBBBBB,BC1,GGGGG,O111, | 1 |
| ID3 | BC1 | 11111,AAAAA,BBBBBB,BC1,GGGGG,O111, | 1 |
| ID4 | O111 | 11111,AAAAA,BBBBBB,BC1,GGGGG,O111, | 1 |
| ID1 | 99999 | 99999,DDDDDD, | 2 |
| ID2 | DDDDDD | 99999,DDDDDD, | 2 |
| ID2 | CCCCCC | CCCCCC, | 3 |
| ID2 | EEEEEE | EEEEEE,O222,O333, | 4 |
| ID4 | O222 | EEEEEE,O222,O333, | 4 |
| ID4 | O333 | EEEEEE,O222,O333, | 4 |
+--------+--------+------------------------------------+---------+
This is how your data looks like as a graph:
I rendered this image using DOT from https://www.graphviz.org/.
How to convert this nomalized output into denormalized? One way is to unpivot it using the help of IDType
, though it might get tricky if the graph can have several loops. You'd better ask another question specifically about converting nomalized dataset into denormalized.
Well, this was a real brain twister ;-) and my solution is just close... Try this:
General remarks:
The query
WITH cte AS
(
SELECT RowId
,A.ID
,A.sourceId
,ROW_NUMBER() OVER(PARTITION BY RowId ORDER BY A.SourceId) AS IdCounter
FROM #ID
CROSS APPLY (VALUES('ID1',ID1),('ID2',ID2),('ID3',ID3),('ID4',ID4)) A(sourceId,ID)
WHERE A.ID IS NOT NULL
)
,AllIDs AS
(
SELECT RowId
,MAX(CASE WHEN IdCounter=1 THEN ID END) AS FirstId
,MAX(CASE WHEN IdCounter=1 THEN sourceId END) AS FirstSource
,MAX(CASE WHEN IdCounter=2 THEN ID END) AS SecondId
,MAX(CASE WHEN IdCounter=2 THEN sourceId END) AS SecondSource
FROM cte
GROUP BY RowId
)
,recCTE AS
(
SELECT RowId
,FirstId
,FirstSource
,SecondId
,SecondSource
,CAST(N'|' + FirstId AS NVARCHAR(MAX)) AS RunningPath
FROM AllIDs WHERE SecondId IS NULL
UNION ALL
SELECT ai.RowId
,ai.FirstId
,ai.FirstSource
,ai.SecondId
,ai.SecondSource
,r.RunningPath + CAST(N'|' + ai.FirstId AS NVARCHAR(MAX))
FROM AllIDs ai
INNER JOIN recCTE r ON ai.RowId<>r.RowId AND (ai.FirstId=r.FirstId OR ai.FirstId=r.SecondId OR ai.SecondId=r.FirstId OR ai.SecondId=r.SecondId )
WHERE r.RunningPath NOT LIKE CONCAT('%|',ai.FirstId,'|%')
)
,FindIslands AS
(
SELECT FirstId
,FirstSource
,SecondId
,SecondSource
,CONCAT(CanonicalPath,'|') AS CanonicalPath
FROM recCTE
CROSS APPLY(SELECT CAST('<x>' + REPLACE(CONCAT(RunningPath,'|',SecondId),'|','</x><x>') + '</x>' AS XML)) A(Casted)
CROSS APPLY(SELECT Casted.query('
for $x in distinct-values(/x[text()])
order by $x
return <x>{concat("|",$x)}</x>
').value('.','nvarchar(max)')) B(CanonicalPath)
)
,MaxPaths AS
(
SELECT fi.CanonicalPath
,x.CanonicalPath AS BestPath
,LEN(x.CanonicalPath) AS PathLength
,ROW_NUMBER() OVER(PARTITION BY fi.CanonicalPath ORDER BY LEN(x.CanonicalPath) DESC) AS SortIndex
FROM FindIslands fi
INNER JOIN FindIslands x ON LEN(x.CanonicalPath)>=LEN(fi.CanonicalPath) AND x.CanonicalPath LIKE CONCAT('%',fi.CanonicalPath,'%' )
--GROUP BY fi.CanonicalPath
)
,AlmostCorrect AS
(
SELECT *
FROM
(
SELECT mp.BestPath,fi.FirstId AS ID,FirstSource AS IDSource
FROM FindIslands fi
INNER JOIN MaxPaths mp On mp.SortIndex=1 AND fi.CanonicalPath=mp.CanonicalPath
UNION ALL
SELECT mp.BestPath,fi.SecondId,SecondSource
FROM FindIslands fi
INNER JOIN MaxPaths mp On mp.SortIndex=1 AND fi.CanonicalPath=mp.CanonicalPath
) t
WHERE ID IS NOT NULL
GROUP BY BestPath,ID,IDSource
)
SELECT * FROm AlmostCorrect;
The result
+--------------------------------+--------+----------+
| BestPath | ID | IDSource |
+--------------------------------+--------+----------+
| |11111|AAAAA|BBBBBB|BC1|GGGGG| | 11111 | ID1 |
+--------------------------------+--------+----------+
| |11111|AAAAA|BBBBBB|BC1|GGGGG| | AAAAA | ID2 |
+--------------------------------+--------+----------+
| |11111|AAAAA|BBBBBB|BC1|GGGGG| | BBBBBB | ID2 |
+--------------------------------+--------+----------+
| |11111|AAAAA|BBBBBB|BC1|GGGGG| | BC1 | ID3 |
+--------------------------------+--------+----------+
| |11111|AAAAA|BBBBBB|BC1|GGGGG| | GGGGG | ID2 |
+--------------------------------+--------+----------+
| |11111|AAAAA|BC1|GGGGG| | BC1 | ID3 |
+--------------------------------+--------+----------+
| |11111|AAAAA|BC1|GGGGG| | GGGGG | ID2 |
+--------------------------------+--------+----------+
| |11111|AAAAA|BC1|O111| | BC1 | ID3 |
+--------------------------------+--------+----------+
| |11111|AAAAA|BC1|O111| | O111 | ID4 |
+--------------------------------+--------+----------+
| |11111|AAAAA|O111| | AAAAA | ID2 |
+--------------------------------+--------+----------+
| |11111|AAAAA|O111| | O111 | ID4 |
+--------------------------------+--------+----------+
| |99999|DDDDDD| | 99999 | ID1 |
+--------------------------------+--------+----------+
| |99999|DDDDDD| | DDDDDD | ID2 |
+--------------------------------+--------+----------+
| |CCCCCC| | CCCCCC | ID2 |
+--------------------------------+--------+----------+
| |EEEEEE|O222|O333| | EEEEEE | ID2 |
+--------------------------------+--------+----------+
| |EEEEEE|O222|O333| | O222 | ID4 |
+--------------------------------+--------+----------+
| |EEEEEE|O222|O333| | O333 | ID4 |
+--------------------------------+--------+----------+
The idea behind:
You can see the result of each intermediate step simply by using SELECT * FROM [cte-name]
as last select (out-comment the current last select).
What we have achieved:
What we did not yet achieve:
|11111|AAAAA|BBBBBB|BC1|GGGGG|
, |11111|AAAAA|BC1|GGGGG|
, |11111|AAAAA|BC1|O111|
, |11111|AAAAA|O111|
are treated as different, although their fragments are overlapping.At the moment I'm to tired to think about this... Might be a get an idea tomorrow ;-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With