Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL- jaccard similarity

My table looks as follows:

author | group 

daniel | group1,group2,group3,group4,group5,group8,group10
adam   | group2,group5,group11,group12
harry  | group1,group10,group15,group13,group15,group18
...
...

I want my output to look like:

author1 | author2 | intersection | union

daniel | adam | 2 | 9
daniel | harry| 2 | 11
adam   | harry| 0 | 10

THANK YOU

like image 682
AspiringSoftwareDeveloper Avatar asked Oct 18 '22 11:10

AspiringSoftwareDeveloper


2 Answers

Try below (for BigQuery)

SELECT
  a.author AS author1, 
  b.author AS author2, 
  SUM(a.item=b.item) AS intersection, 
  EXACT_COUNT_DISTINCT(a.item) + EXACT_COUNT_DISTINCT(b.item) - intersection AS [union]
FROM FLATTEN((
  SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS a
CROSS JOIN FLATTEN((
  SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS b
WHERE a.author < b.author 
GROUP BY 1,2

Added solution for BigQuery Standard SQL

WITH YourTable AS (
  SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
  SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
  SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
  SELECT author, SPLIT(grp) AS grp
  FROM YourTable
)
SELECT 
  a.author AS author1, 
  b.author  AS author2,
  (SELECT COUNT(1) FROM a.grp) AS count1,
  (SELECT COUNT(1) FROM b.grp) AS count2,
  (SELECT COUNT(1) FROM UNNEST(a.grp) AS agrp JOIN UNNEST(b.grp) AS bgrp ON agrp = bgrp) AS intersection_count,
  (SELECT COUNT(1) FROM (SELECT * FROM UNNEST(a.grp) UNION DISTINCT SELECT * FROM UNNEST(b.grp))) AS union_count
FROM tempTable a
JOIN tempTable b
ON a.author < b.author

What I like about this one:

  • much simpler / friendlier code
  • no CROSS JOIN and extra GROUP BY needed

When/If try - make sure to uncheck Use Legacy SQL checkbox under Show Options

like image 118
Mikhail Berlyant Avatar answered Oct 21 '22 04:10

Mikhail Berlyant


I propose this option that scales better:

WITH YourTable AS (
  SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
  SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
  SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),

tempTable AS (
  SELECT author, grp
  FROM YourTable, UNNEST(SPLIT(grp)) as grp
),

intersection AS (
  SELECT a.author AS author1, b.author AS author2, COUNT(1) as intersection
  FROM tempTable a 
  JOIN tempTable b
  USING (grp)
  WHERE a.author > b.author
  GROUP BY a.author, b.author
),

count_distinct_groups AS (
  SELECT author, COUNT(DISTINCT grp) as count_distinct_groups
  FROM tempTable
  GROUP BY author
),

join_it AS (
  SELECT
    intersection.*, cg1.count_distinct_groups AS count_distinct_groups1, cg2.count_distinct_groups AS count_distinct_groups2
  FROM
    intersection
  JOIN
    count_distinct_groups cg1
  ON
    intersection.author1 = cg1.author
  JOIN
    count_distinct_groups cg2
  ON
    intersection.author2 = cg2.author
)

SELECT
  *,
  count_distinct_groups1 + count_distinct_groups2 - intersection AS unionn,
  intersection / (count_distinct_groups1 + count_distinct_groups2 - intersection) AS jaccard
FROM
  join_it

A full cross join on Big Data (tens of thousands x millions) fails for too much shuffling while the second proposal takes hours to execute. That one takes minutes.

The consequence of this approach though is that pairs having no intersection will not appear, so it will be the responsibility of the process that uses it to handle IFNULL.

Last detail: the union on Daniel and Harry is 10 rather than 11 as group15 is repeated in the initial example.

like image 28
Laurent Candillier Avatar answered Oct 21 '22 05:10

Laurent Candillier