SQL- jaccard similarity

Question

My table looks as follows:

author | group 

daniel | group1,group2,group3,group4,group5,group8,group10
adam   | group2,group5,group11,group12
harry  | group1,group10,group15,group13,group15,group18
...
...

I want my output to look like:

author1 | author2 | intersection | union

daniel | adam | 2 | 9
daniel | harry| 2 | 11
adam   | harry| 0 | 10

THANK YOU

Mikhail Berlyant · Accepted Answer

Try below (for BigQuery)

SELECT
  a.author AS author1, 
  b.author AS author2, 
  SUM(a.item=b.item) AS intersection, 
  EXACT_COUNT_DISTINCT(a.item) + EXACT_COUNT_DISTINCT(b.item) - intersection AS [union]
FROM FLATTEN((
  SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS a
CROSS JOIN FLATTEN((
  SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS b
WHERE a.author < b.author 
GROUP BY 1,2

Added solution for BigQuery Standard SQL

WITH YourTable AS (
  SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
  SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
  SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
  SELECT author, SPLIT(grp) AS grp
  FROM YourTable
)
SELECT 
  a.author AS author1, 
  b.author  AS author2,
  (SELECT COUNT(1) FROM a.grp) AS count1,
  (SELECT COUNT(1) FROM b.grp) AS count2,
  (SELECT COUNT(1) FROM UNNEST(a.grp) AS agrp JOIN UNNEST(b.grp) AS bgrp ON agrp = bgrp) AS intersection_count,
  (SELECT COUNT(1) FROM (SELECT * FROM UNNEST(a.grp) UNION DISTINCT SELECT * FROM UNNEST(b.grp))) AS union_count
FROM tempTable a
JOIN tempTable b
ON a.author < b.author

What I like about this one:

much simpler / friendlier code
no CROSS JOIN and extra GROUP BY needed

When/If try - make sure to uncheck Use Legacy SQL checkbox under Show Options

Laurent Candillier · Answer

I propose this option that scales better:

WITH YourTable AS (
  SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
  SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
  SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),

tempTable AS (
  SELECT author, grp
  FROM YourTable, UNNEST(SPLIT(grp)) as grp
),

intersection AS (
  SELECT a.author AS author1, b.author AS author2, COUNT(1) as intersection
  FROM tempTable a 
  JOIN tempTable b
  USING (grp)
  WHERE a.author > b.author
  GROUP BY a.author, b.author
),

count_distinct_groups AS (
  SELECT author, COUNT(DISTINCT grp) as count_distinct_groups
  FROM tempTable
  GROUP BY author
),

join_it AS (
  SELECT
    intersection.*, cg1.count_distinct_groups AS count_distinct_groups1, cg2.count_distinct_groups AS count_distinct_groups2
  FROM
    intersection
  JOIN
    count_distinct_groups cg1
  ON
    intersection.author1 = cg1.author
  JOIN
    count_distinct_groups cg2
  ON
    intersection.author2 = cg2.author
)

SELECT
  *,
  count_distinct_groups1 + count_distinct_groups2 - intersection AS unionn,
  intersection / (count_distinct_groups1 + count_distinct_groups2 - intersection) AS jaccard
FROM
  join_it

A full cross join on Big Data (tens of thousands x millions) fails for too much shuffling while the second proposal takes hours to execute. That one takes minutes.

The consequence of this approach though is that pairs having no intersection will not appear, so it will be the responsibility of the process that uses it to handle IFNULL.

Last detail: the union on Daniel and Harry is 10 rather than 11 as group15 is repeated in the initial example.

SQL- jaccard similarity

Tags:

sql

google-bigquery

AspiringSoftwareDeveloper

2 Answers

Mikhail Berlyant

Laurent Candillier

Recent Activity

Donate For Us

SQL- jaccard similarity

Tags:

sql

google-bigquery

AspiringSoftwareDeveloper

2 Answers

Mikhail Berlyant

Laurent Candillier

Related questions

Recent Activity

Donate For Us