Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

INNER JOIN on same table counting word occurrences

http://sqlfiddle.com/#!2/e6382

id   id_news  word
 1    6       superman
 2    6       movie
 3    6       review
 4    6       excellent
 5    7       review
 6    7       guardians of the galaxy
 7    7       great
 8    8       review
 9    8       superman
10    8       movie
11    8       great

I have a small problem, I'm trying to relate different news through words with a threshold setting, in the example provided id_news 6 should be related to 8 but not to 7 since 7 only has 2 words in common and I only want to detect those who have at least 3 words in common.

like image 680
ln -s Avatar asked Aug 13 '14 13:08

ln -s


2 Answers

This will get you close to what you need:

  SELECT wa1.id_news id, wa2.id_news related
    FROM word_analysis wa1
    JOIN word_analysis wa2
      ON wa2.id_news != wa1.id_news
     AND wa2.word = wa1.word
GROUP BY wa1.id_news, wa2.id_news
  HAVING COUNT(*)>2
ORDER BY wa1.id_news, wa2.id_news

If you don't want the reverse relationships:

  SELECT wa1.id_news id, wa2.id_news related
    FROM word_analysis wa1
    JOIN word_analysis wa2
      ON wa2.id_news > wa1.id_news
     AND wa2.word = wa1.word
GROUP BY wa1.id_news, wa2.id_news
  HAVING COUNT(*)>2
ORDER BY wa1.id_news, wa2.id_news

If you want to investigate just one wa1.id_news (6):

  SELECT wa2.id_news related
    FROM word_analysis wa1
    JOIN word_analysis wa2
      ON wa2.id_news != wa1.id_news
     AND wa2.word = wa1.word
   WHERE wa1.id_news = 6
GROUP BY wa1.id_news, wa2.id_news
  HAVING COUNT(*)>2
ORDER BY wa2.id_news

If you want to investigate just one relationship (6->8), where a result means related and no result means unrelated:

  SELECT 1
    FROM word_analysis wa1
    JOIN word_analysis wa2
      ON wa2.id_news = 8
     AND wa2.word = wa1.word
   WHERE wa1.id_news = 6
GROUP BY wa1.id_news, wa2.id_news
  HAVING COUNT(*)>2
like image 72
Arth Avatar answered Oct 17 '22 16:10

Arth


Try this self-join:

SELECT
  wa1.id_news id_news_1,
  wa2.id_news id_news_2,
  count(wa2.word) cnt_words
FROM word_analysis wa1
INNER JOIN word_analysis wa2
ON wa1.id_news <> wa2.id_news AND wa1.word = wa2.word
GROUP BY wa1.id_news, wa2.id_news
HAVING count(wa2.word) >= 3
ORDER BY wa1.id_news, wa2.id_news;

SQL Fiddle demo

like image 3
Joseph B Avatar answered Oct 17 '22 17:10

Joseph B