Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get the most popular phrases from a lot of text?

Tags:

php

I'm setting up a Twitter-style "trending topics" box for my forum. I've got the most popular /words/, but can't even begin to think how I will get popular phrases, like Twitter does.

As it stands I just get all the content of the last 200 posts into a string and split them into words, then sort by which words are used the most. How can I turn this from most popular words into the most popular phrases?

like image 480
katoth Avatar asked Oct 13 '10 20:10

katoth


2 Answers

One technique you might consider is the use of ZSETs in Redis for something like this. If you've got very large sets of data, you'll find that you can do something like this:

$words = explode(" ", $input); // Pseudo-code for breaking a block of data into individual words.
$word_count = count($words);

$r = new Redis(); // Owlient's PHPRedis PECL extension
$r->connect("127.0.0.1", 6379);

function process_phrase($phrase) {
    global $r;
    $phrase = implode(" ", $phrase);
    $r->zIncrBy("trending_phrases", 1, $phrase);
}

for($i=0;$i<$word_count;$i++)
    for($j=1;$j<$word_count - $i;$j++)
        process_phrase(array_slice($words, $i, $j));

To retrieve the top phrases, you'd use this:

// Assume $r is instantiated like it is above
$trending_phrases = $r->zReverseRange("trending_phrases", 0, 10);

$trending_phrases will be an array of the top ten trending phrases. To do things like recent trending phrases (as opposed to a persistent, global set of phrases), duplicate all of the Redis interactions above. For each interaction, use a key that's indicative of, say, today's timestamp and tomorrow's timestamp (i.e.: days since Jan 1, 1970). When retrieving the results with $trending_phrases, just retrieve both today and tomorrow's (or yesterday's) key and use array_merge and array_unique to find the union.

Hope this helps!

like image 188
mattbasta Avatar answered Oct 04 '22 02:10

mattbasta


Im not sure what type of answer you were looking for but Laconica:

http://status.net/?source=laconica

Is an open source twitter clone (a much simpler version).

Maybe you could use part of the code to make your own popular frases?

Good luck!

like image 23
Trufa Avatar answered Oct 04 '22 01:10

Trufa