My table looks as follows:
author | group
daniel | group1,group2,group3,group4,group5,group8,group10
adam | group2,group5,group11,group12
harry | group1,group10,group15,group13,group15,group18
...
...
I want my output to look like:
author1 | author2 | intersection | union
daniel | adam | 2 | 9
daniel | harry| 2 | 11
adam | harry| 0 | 10
THANK YOU
Try below (for BigQuery)
SELECT
a.author AS author1,
b.author AS author2,
SUM(a.item=b.item) AS intersection,
EXACT_COUNT_DISTINCT(a.item) + EXACT_COUNT_DISTINCT(b.item) - intersection AS [union]
FROM FLATTEN((
SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS a
CROSS JOIN FLATTEN((
SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS b
WHERE a.author < b.author
GROUP BY 1,2
Added solution for BigQuery Standard SQL
WITH YourTable AS (
SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
SELECT author, SPLIT(grp) AS grp
FROM YourTable
)
SELECT
a.author AS author1,
b.author AS author2,
(SELECT COUNT(1) FROM a.grp) AS count1,
(SELECT COUNT(1) FROM b.grp) AS count2,
(SELECT COUNT(1) FROM UNNEST(a.grp) AS agrp JOIN UNNEST(b.grp) AS bgrp ON agrp = bgrp) AS intersection_count,
(SELECT COUNT(1) FROM (SELECT * FROM UNNEST(a.grp) UNION DISTINCT SELECT * FROM UNNEST(b.grp))) AS union_count
FROM tempTable a
JOIN tempTable b
ON a.author < b.author
What I like about this one:
When/If try - make sure to uncheck Use Legacy SQL
checkbox under Show Options
I propose this option that scales better:
WITH YourTable AS (
SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
SELECT author, grp
FROM YourTable, UNNEST(SPLIT(grp)) as grp
),
intersection AS (
SELECT a.author AS author1, b.author AS author2, COUNT(1) as intersection
FROM tempTable a
JOIN tempTable b
USING (grp)
WHERE a.author > b.author
GROUP BY a.author, b.author
),
count_distinct_groups AS (
SELECT author, COUNT(DISTINCT grp) as count_distinct_groups
FROM tempTable
GROUP BY author
),
join_it AS (
SELECT
intersection.*, cg1.count_distinct_groups AS count_distinct_groups1, cg2.count_distinct_groups AS count_distinct_groups2
FROM
intersection
JOIN
count_distinct_groups cg1
ON
intersection.author1 = cg1.author
JOIN
count_distinct_groups cg2
ON
intersection.author2 = cg2.author
)
SELECT
*,
count_distinct_groups1 + count_distinct_groups2 - intersection AS unionn,
intersection / (count_distinct_groups1 + count_distinct_groups2 - intersection) AS jaccard
FROM
join_it
A full cross join on Big Data (tens of thousands x millions) fails for too much shuffling while the second proposal takes hours to execute. That one takes minutes.
The consequence of this approach though is that pairs having no intersection will not appear, so it will be the responsibility of the process that uses it to handle IFNULL.
Last detail: the union on Daniel and Harry is 10 rather than 11 as group15 is repeated in the initial example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With