I need some help for a problem that i am struggling to solve.
Example table:
ID |Identifier1 | Identifier2
---------------------------------
1 | a | c
2 | b | f
3 | a | g
4 | c | h
5 | b | j
6 | d | f
7 | e | k
8 | i |
9 | l | h
I want to group identifiers that are related with each other between two columns and assign a unique group id.
Desired Output:
Identifier | Gr_ID | Gr.Members
---------------------------------------------------
a | 1 | (a,c,g,h,l)
b | 2 | (b,d,f,j)
c | 1 | (a,c,g,h,l)
d | 2 | (b,d,f,j)
e | 3 | (e,k)
f | 2 | (b,d,f,j)
g | 1 | (a,c,g,h,l)
h | 1 | (a,c,g,h,l)
j | 2 | (b,d,f,j)
k | 3 | (e,k)
l | 1 | (a,c,g,h,l)
i | 4 | (i)
Note:the column Gr.Members is not necessary, mostly is used for a clearer view.
So the definition for a group is: A row belongs to a group if it shares at least one identifier with at least one row of this group
But the group id has to be assigned to each identifier(selected by the union of the two columns) not to the row.
Any help on how to build a query to give the desired output?
Thank you.
Update: Below are some extra sample sets with their expected output.
Given table:
Identifier1 | Identifier2
----------------------------
a | f
a | g
a | NULL
b | c
b | a
b | h
b | j
b | NULL
b | NULL
b | g
c | k
c | b
d | l
d | f
d | g
d | m
d | a
d | NULL
d | a
e | c
e | b
e | NULL
Expected output: all the records should belong to the same group with group ID = 1.
Given Table:
Identifier1 | Identifier2
--------------------------
a | a
b | b
c | a
c | b
c | c
Expected output: The records should be in the same group with group ID = 1.
In a connected graph, there is exactly one component: the whole graph.
A subgraph G′ = (V′, E′) of G is a graph with V′ ⊆ V and E ⊆ E1, where E1 is a subset of E, whose edges connect vertices that lie in V′. Clearly, G is a subgraph of itself. A subgraph G′ = (V′, E′) is connected if there exists at least one path connecting any pair of vertices in V′ (Figure 13.5c).
First, mark all vertices as unvisited. Iterate over all vertices. If a vertex is not visited, perform DFS on that vertex and increment the count by 1. After iterating over all vertices, the value of count will be the number of connected components in the graph.
Here is a variant that doesn't use cursor, but uses a single recursive query.
Essentially, it treats the data as edges in a graph and traverses recursively all edges of the graph, stopping when the loop is detected. Then it puts all found loops in groups and gives each group a number.
See the detailed explanations of how it works below. I recommend you to run the query CTE-by-CTE and examine each intermediate result to understand what it does.
Sample 1
DECLARE @T TABLE (ID int, Ident1 char(1), Ident2 char(1));
INSERT INTO @T (ID, Ident1, Ident2) VALUES
(1, 'a', 'a'),
(2, 'b', 'b'),
(3, 'c', 'a'),
(4, 'c', 'b'),
(5, 'c', 'c');
Sample 2
I added one more row with z
value to have multiple rows with unpaired values.
DECLARE @T TABLE (ID int, Ident1 char(1), Ident2 char(1));
INSERT INTO @T (ID, Ident1, Ident2) VALUES
(1, 'a', 'a'),
(1, 'a', 'c'),
(2, 'b', 'f'),
(3, 'a', 'g'),
(4, 'c', 'h'),
(5, 'b', 'j'),
(6, 'd', 'f'),
(7, 'e', 'k'),
(8, 'i', NULL),
(88, 'z', 'z'),
(9, 'l', 'h');
Sample 3
DECLARE @T TABLE (ID int, Ident1 char(1), Ident2 char(1));
INSERT INTO @T (ID, Ident1, Ident2) VALUES
(1, 'a', 'f'),
(2, 'a', 'g'),
(3, 'a', NULL),
(4, 'b', 'c'),
(5, 'b', 'a'),
(6, 'b', 'h'),
(7, 'b', 'j'),
(8, 'b', NULL),
(9, 'b', NULL),
(10, 'b', 'g'),
(11, 'c', 'k'),
(12, 'c', 'b'),
(13, 'd', 'l'),
(14, 'd', 'f'),
(15, 'd', 'g'),
(16, 'd', 'm'),
(17, 'd', 'a'),
(18, 'd', NULL),
(19, 'd', 'a'),
(20, 'e', 'c'),
(21, 'e', 'b'),
(22, 'e', NULL);
Query
WITH
CTE_Idents
AS
(
SELECT Ident1 AS Ident
FROM @T
UNION
SELECT Ident2 AS Ident
FROM @T
)
,CTE_Pairs
AS
(
SELECT Ident1, Ident2
FROM @T
WHERE Ident1 <> Ident2
UNION
SELECT Ident2 AS Ident1, Ident1 AS Ident2
FROM @T
WHERE Ident1 <> Ident2
)
,CTE_Recursive
AS
(
SELECT
CAST(CTE_Idents.Ident AS varchar(8000)) AS AnchorIdent
, Ident1
, Ident2
, CAST(',' + Ident1 + ',' + Ident2 + ',' AS varchar(8000)) AS IdentPath
, 1 AS Lvl
FROM
CTE_Pairs
INNER JOIN CTE_Idents ON CTE_Idents.Ident = CTE_Pairs.Ident1
UNION ALL
SELECT
CTE_Recursive.AnchorIdent
, CTE_Pairs.Ident1
, CTE_Pairs.Ident2
, CAST(CTE_Recursive.IdentPath + CTE_Pairs.Ident2 + ',' AS varchar(8000)) AS IdentPath
, CTE_Recursive.Lvl + 1 AS Lvl
FROM
CTE_Pairs
INNER JOIN CTE_Recursive ON CTE_Recursive.Ident2 = CTE_Pairs.Ident1
WHERE
CTE_Recursive.IdentPath NOT LIKE CAST('%,' + CTE_Pairs.Ident2 + ',%' AS varchar(8000))
)
,CTE_RecursionResult
AS
(
SELECT AnchorIdent, Ident1, Ident2
FROM CTE_Recursive
)
,CTE_CleanResult
AS
(
SELECT AnchorIdent, Ident1 AS Ident
FROM CTE_RecursionResult
UNION
SELECT AnchorIdent, Ident2 AS Ident
FROM CTE_RecursionResult
)
SELECT
CTE_Idents.Ident
,CASE WHEN CA_Data.XML_Value IS NULL
THEN CTE_Idents.Ident ELSE CA_Data.XML_Value END AS GroupMembers
,DENSE_RANK() OVER(ORDER BY
CASE WHEN CA_Data.XML_Value IS NULL
THEN CTE_Idents.Ident ELSE CA_Data.XML_Value END
) AS GroupID
FROM
CTE_Idents
CROSS APPLY
(
SELECT CTE_CleanResult.Ident+','
FROM CTE_CleanResult
WHERE CTE_CleanResult.AnchorIdent = CTE_Idents.Ident
ORDER BY CTE_CleanResult.Ident FOR XML PATH(''), TYPE
) AS CA_XML(XML_Value)
CROSS APPLY
(
SELECT CA_XML.XML_Value.value('.', 'NVARCHAR(MAX)')
) AS CA_Data(XML_Value)
WHERE
CTE_Idents.Ident IS NOT NULL
ORDER BY Ident;
Result 1
+-------+--------------+---------+
| Ident | GroupMembers | GroupID |
+-------+--------------+---------+
| a | a,b,c, | 1 |
| b | a,b,c, | 1 |
| c | a,b,c, | 1 |
+-------+--------------+---------+
Result 2
+-------+--------------+---------+
| Ident | GroupMembers | GroupID |
+-------+--------------+---------+
| a | a,c,g,h,l, | 1 |
| b | b,d,f,j, | 2 |
| c | a,c,g,h,l, | 1 |
| d | b,d,f,j, | 2 |
| e | e,k, | 3 |
| f | b,d,f,j, | 2 |
| g | a,c,g,h,l, | 1 |
| h | a,c,g,h,l, | 1 |
| i | i | 4 |
| j | b,d,f,j, | 2 |
| k | e,k, | 3 |
| l | a,c,g,h,l, | 1 |
| z | z | 5 |
+-------+--------------+---------+
Result 3
+-------+--------------------------+---------+
| Ident | GroupMembers | GroupID |
+-------+--------------------------+---------+
| a | a,b,c,d,e,f,g,h,j,k,l,m, | 1 |
| b | a,b,c,d,e,f,g,h,j,k,l,m, | 1 |
| c | a,b,c,d,e,f,g,h,j,k,l,m, | 1 |
| d | a,b,c,d,e,f,g,h,j,k,l,m, | 1 |
| e | a,b,c,d,e,f,g,h,j,k,l,m, | 1 |
| f | a,b,c,d,e,f,g,h,j,k,l,m, | 1 |
| g | a,b,c,d,e,f,g,h,j,k,l,m, | 1 |
| h | a,b,c,d,e,f,g,h,j,k,l,m, | 1 |
| j | a,b,c,d,e,f,g,h,j,k,l,m, | 1 |
| k | a,b,c,d,e,f,g,h,j,k,l,m, | 1 |
| l | a,b,c,d,e,f,g,h,j,k,l,m, | 1 |
| m | a,b,c,d,e,f,g,h,j,k,l,m, | 1 |
+-------+--------------------------+---------+
I'll use the second set of sample data for this explanation.
CTE_Idents
CTE_Idents
gives the list of all Identifiers that appear in both Ident1
and Ident2
columns.
Since they can appear in any order we UNION
both columns together. UNION
also removes any duplicates.
+-------+
| Ident |
+-------+
| NULL |
| a |
| b |
| c |
| d |
| e |
| f |
| g |
| h |
| i |
| j |
| k |
| l |
| z |
+-------+
CTE_Pairs
CTE_Pairs
gives the list of all edges of the graph in both directions. Again, UNION
is used to remove any duplicates.
+--------+--------+
| Ident1 | Ident2 |
+--------+--------+
| a | c |
| a | g |
| b | f |
| b | j |
| c | a |
| c | h |
| d | f |
| e | k |
| f | b |
| f | d |
| g | a |
| h | c |
| h | l |
| j | b |
| k | e |
| l | h |
+--------+--------+
CTE_Recursive
CTE_Recursive
is the main part of the query that recursively traverses the graph starting from each unique Identifier.
These starting rows are produced by the first part of UNION ALL
.
The second part of UNION ALL
recursively joins to itself linking Ident2
to Ident1
.
Since we pre-made CTE_Pairs
with all edges written in both directions, we can always link only Ident2
to Ident1
and we'll get all paths in the graph.
At the same time the query builds IdentPath
- a string of comma-delimited Identifiers that have been traversed so far.
It is used in the WHERE
filter:
CTE_Recursive.IdentPath NOT LIKE CAST('%,' + CTE_Pairs.Ident2 + ',%' AS varchar(8000))
As soon as we come across the Identifier that had been included in the Path before, the recursion stops as the list of connected nodes is exhausted.
AnchorIdent
is the starting Identifier for the recursion, it will be used later to group results.
Lvl
is not really used, I included it for better understanding of what is going on.
+-------------+--------+--------+-------------+-----+
| AnchorIdent | Ident1 | Ident2 | IdentPath | Lvl |
+-------------+--------+--------+-------------+-----+
| a | a | c | ,a,c, | 1 |
| a | a | g | ,a,g, | 1 |
| b | b | f | ,b,f, | 1 |
| b | b | j | ,b,j, | 1 |
| c | c | a | ,c,a, | 1 |
| c | c | h | ,c,h, | 1 |
| d | d | f | ,d,f, | 1 |
| e | e | k | ,e,k, | 1 |
| f | f | b | ,f,b, | 1 |
| f | f | d | ,f,d, | 1 |
| g | g | a | ,g,a, | 1 |
| h | h | c | ,h,c, | 1 |
| h | h | l | ,h,l, | 1 |
| j | j | b | ,j,b, | 1 |
| k | k | e | ,k,e, | 1 |
| l | l | h | ,l,h, | 1 |
| l | h | c | ,l,h,c, | 2 |
| l | c | a | ,l,h,c,a, | 3 |
| l | a | g | ,l,h,c,a,g, | 4 |
| j | b | f | ,j,b,f, | 2 |
| j | f | d | ,j,b,f,d, | 3 |
| h | c | a | ,h,c,a, | 2 |
| h | a | g | ,h,c,a,g, | 3 |
| g | a | c | ,g,a,c, | 2 |
| g | c | h | ,g,a,c,h, | 3 |
| g | h | l | ,g,a,c,h,l, | 4 |
| f | b | j | ,f,b,j, | 2 |
| d | f | b | ,d,f,b, | 2 |
| d | b | j | ,d,f,b,j, | 3 |
| c | h | l | ,c,h,l, | 2 |
| c | a | g | ,c,a,g, | 2 |
| b | f | d | ,b,f,d, | 2 |
| a | c | h | ,a,c,h, | 2 |
| a | h | l | ,a,c,h,l, | 3 |
+-------------+--------+--------+-------------+-----+
CTE_CleanResult
CTE_CleanResult
leaves only relevant parts from CTE_Recursive
and again merges both Ident1
and Ident2
using UNION
.
+-------------+-------+
| AnchorIdent | Ident |
+-------------+-------+
| a | a |
| a | c |
| a | g |
| a | h |
| a | l |
| b | b |
| b | d |
| b | f |
| b | j |
| c | a |
| c | c |
| c | g |
| c | h |
| c | l |
| d | b |
| d | d |
| d | f |
| d | j |
| e | e |
| e | k |
| f | b |
| f | d |
| f | f |
| f | j |
| g | a |
| g | c |
| g | g |
| g | h |
| g | l |
| h | a |
| h | c |
| h | g |
| h | h |
| h | l |
| j | b |
| j | d |
| j | f |
| j | j |
| k | e |
| k | k |
| l | a |
| l | c |
| l | g |
| l | h |
| l | l |
+-------------+-------+
Final SELECT
Now we need to build a string of comma-separated Ident
values for each AnchorIdent
.
CROSS APPLY
with FOR XML
does it.
DENSE_RANK()
calculates the GroupID
numbers for each AnchorIdent
.
My suggestion is to use stored procedure with cursor. It is easy to implement and relatively fast. Only two steps:
Query:
CREATE TABLE #PairIds
(
Ident1 VARCHAR(10),
Ident2 VARCHAR(10)
)
INSERT INTO #PairIds
VALUES ('a', 'c'),
('b', 'f'),
('a', 'g'),
('c', 'h'),
('b', 'j'),
('d', 'f'),
('e', 'k'),
('l', 'h')
exec [dbo].[sp_GetIdentByGroup]
Result:
Ident | GroupID
---------------------------------------------------
a | 1 |
b | 2 |
c | 1 |
d | 2 |
e | 3 |
f | 2 |
g | 1 |
h | 1 |
j | 2 |
k | 3 |
l | 1 |
Code for creating stored procedure:
CREATE PROCEDURE [dbo].[sp_GetIdentByGroup]
AS
BEGIN
DECLARE @message VARCHAR(70);
DECLARE @IdentInput1 varchar(20)
DECLARE @IdentInput2 varchar(20)
DECLARE @Counter INT
DECLARE @Group1 INT
DECLARE @Group2 INT
DECLARE @Ident varchar(20)
DECLARE @IdentCheck1 varchar(20)
DECLARE @IdentCheck2 varchar(20)
SET @Counter = 1
DECLARE @IdentByGroupCursor TABLE (
Ident varchar(20) UNIQUE CLUSTERED,
GroupID INT
);
-- Use a cursor to select your data, which enables SQL Server to extract
-- the data from your local table to the variables.
declare ins_cursor cursor for
select Ident1, Ident2 from #PairIds
open ins_cursor
fetch next from ins_cursor into @IdentInput1, @IdentInput2 -- At this point, the data from the first row
-- is in your local variables.
-- Move through the table with the @@FETCH_STATUS=0
WHILE @@FETCH_STATUS=0
BEGIN
SET @Group1 = null
SET @Group2 = null
SELECT TOP 1 @Group1 = GroupID, @IdentCheck1 = Ident
FROM @IdentByGroupCursor
WHERE Ident in (@IdentInput1)
SELECT TOP 1 @Group2 = GroupID, @IdentCheck2 = Ident
FROM @IdentByGroupCursor
WHERE Ident in (@IdentInput2)
IF (@Group1 IS NOT NULL AND @Group2 IS NOT NULL)
BEGIN
IF @Group1 > @Group2
BEGIN
UPDATE @IdentByGroupCursor
SET GroupID = @Group2
WHERE
GroupID = @Group1
END
IF @Group2 > @Group1
BEGIN
UPDATE @IdentByGroupCursor
SET GroupID = @Group1
WHERE
GroupID = @Group2
END
END
ELSE IF @Group1 IS NOT NULL
BEGIN
UPDATE @IdentByGroupCursor
SET GroupID = @Group1
WHERE
Ident IN (@IdentInput1)
END
ELSE IF @Group2 IS NOT NULL
BEGIN
UPDATE @IdentByGroupCursor
SET GroupID = @Group2
WHERE
Ident IN (@IdentInput2)
END
IF (@Group1 IS NOT NULL AND @Group2 IS NOT NULL)
BEGIN
IF @Group1 > @Group2
BEGIN
UPDATE @IdentByGroupCursor
SET GroupID = @Group2
WHERE
GroupID = @Group1
END
IF @Group2 > @Group1
BEGIN
UPDATE @IdentByGroupCursor
SET GroupID = @Group1
WHERE
GroupID = @Group2
END
END
IF @Group1 IS NULL
BEGIN
INSERT INTO @IdentByGroupCursor (Ident, GroupID)
VALUES (@IdentInput1, ISNULL(@Group2, @Counter))
END
IF @Group2 IS NULL
BEGIN
INSERT INTO @IdentByGroupCursor (Ident, GroupID)
VALUES (@IdentInput2, ISNULL(@Group1, @COunter))
END
IF (@Group1 IS NULL OR @Group2 IS NULL)
BEGIN
SET @COunter = @COunter +1
END
-- Once the execution has taken place, you fetch the next row of data from your local table.
fetch next from ins_cursor into @IdentInput1, @IdentInput2
End
-- When all the rows have inserted you must close and deallocate the cursor.
-- Failure to do this will not let you re-use the cursor.
close ins_cursor
deallocate ins_cursor
SELECT Ident ,DENSE_RANK() OVER( ORDER BY GroupID ASC) AS GroupID
FROM @IdentByGroupCursor
ORDER BY Ident
END
GO
Sp_GetIdentByGroup has an index for speed and with the use of a cursor, it prepares desired result set. The stored procedure expects #PairIds table to exist.
More information on SQL How to group identifiers that are related with each other in specific groups.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With