Creating an efficient way to build a related articles feature with php and mysql

Question

First, let me begin by saying that I have done a lot of research on this topic and have already invested a lot of time into a workable solution. With that said, I'm running into some issues that I can't seem to overcome and am therefore seeking some guidance in the right direction.

Little backstory: I write/maintain the php/mysql for a website. We are basically a gaming site that posts articles, reviews, videos, etc.

Question: I have a mysql database that stores all of the website's content. There are basically 4 fields in this database from which I pull words, that I would then like to match to all other articles in the database and determine the top 3 related articles so that they can be displayed. Most efficient and best way to accomplish this?

Here's what I've done so far:

In the CMS I have designed, I have essentially designed a "bag-of-words" type system. The program goes through all the articles (there are about 4,000) and breaks down every word into a separate database. In this separate database the word, word count in article, tf*idf (more on this later), and article id (x-ref to content database) are stored. So, a word can be in this database more than once, but not more than once for one article. After processing this (which takes about 4 minutes) there are close to 700,000 entries in this new database.

Then, I have another program that goes through this new word database and parses its tf*idf. Going through the entire list of 700,000 entries takes this program about 15 minutes.

Now, this is the part that I'm stuck at. I'm working on the frontend part of this to actually make the system usable. The frontend part does a query of the database for the current article being viewed (article_id) and pulls the top 20 words sorted by their tf*idf. Then, I pull these words and query them against other articles containing the words and have an array that stores the articles being compared and the number of times they match up. Then, sort the array and pull the top 3 articles with the highest number of comparisons.

This last part works fine and dandy and I actually get pretty good comparisons using a mix between tf*idf and bag-of-words. The problem is that for the frontend part to happen, it takes anywhere from 30-45 seconds. Obviously this is not feasible... it has to be done in a fraction of a second and this is where I'm running into my issue.

I know this was really long, and I apologize for that. I'm basically looking for some help cleaning this idea up, some place I went wrong, different approach. I'm open to all suggestions and would be happy to provide any more information if it would make any of this clearer. Thanks for your time!

Per request, table schema and front end code...

--
-- Table structure for table `bagofwords`
--
CREATE TABLE IF NOT EXISTS `bagofwords` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `article_id` int(11) NOT NULL,
  `article_total_word_count` int(11) NOT NULL,
  `word` text NOT NULL,
  `count` int(11) NOT NULL,
  `timestamp` int(11) NOT NULL,
  `tfidf` float NOT NULL,
  KEY `id` (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 AUTO_INCREMENT=660930 ;


public function related_articles($article_id, $count = 3) {
        $query = "SELECT * FROM `bagofwords` WHERE `article_id` = '$article_id' ORDER BY `tfidf` DESC LIMIT 20";
        $result = $this->db->query($query);
        $num_rows = $this->db->num_rows($result);

        $articles_list = array();
        for ($i=0; $i<$num_rows; $i++) {
            $word = $this->db->fetch_field($result, 'word', $i);

            $query_word = "SELECT `article_id` FROM `bagofwords` WHERE `word` = '$word' AND `article_id` != '$article_id' ORDER BY `tfidf` DESC";
            $result_word = $this->db->query($query_word);
            $result_num_rows = $this->db->num_rows($result_word);
            for ($x=0; $x<$result_num_rows; $x++) {
                $article_id_word = $this->db->fetch_field($result_word, 'article_id', $x);
                if (isset($articles_list["$article_id_word"])) $articles_list["$article_id_word"]++;
                else $articles_list["$article_id_word"] = 1;
            }
        }

        array_flip($articles_list);
        asort($articles_list);
        return $articles_list;

    }

Ok, this is pretty much the frontend code part, as of right now it returns the entire array and var_dumps on the frontend just to see what kinda data I'm getting. But ya, there has gotta be a better way to write all of this in a single mySQL statement using nested stuff or temp tables. I can't figure it out!

Neville Kuyt · Accepted Answer

The obvious thing is to run this query as a self-join. I'd need to test on production volumes to optimize it, but something like:

select word, count(*) as article_count
from   bagofwords article, 
       bagofwords relations
where  article.article_id = '$article_id'
and    article.word       = relation.word
group by word
order by article.tfidf, article_count

limit 20

You also want an index on the colum "word":

create index word on bagofwords(word)

Creating an efficient way to build a related articles feature with php and mysql

Tags:

php

mysql

Lyynk424

1 Answers

Neville Kuyt

Recent Activity

Donate For Us

Creating an efficient way to build a related articles feature with php and mysql

Tags:

php

mysql

Lyynk424

1 Answers

Neville Kuyt

Related questions

Recent Activity

Donate For Us