Best clustering algorithm? (simply explained)

Tags:

Imagine the following problem:

You have a database containing about 20,000 texts in a table called "articles"
You want to connect the related ones using a clustering algorithm in order to display related articles together
The algorithm should do flat clustering (not hierarchical)
The related articles should be inserted into the table "related"
The clustering algorithm should decide whether two or more articles are related or not based on the texts
I want to code in PHP but examples with pseudo code or other programming languages are ok, too

I've coded a first draft with a function check() which gives "true" if the two input articles are related and "false" if not. The rest of the code (selecting the articles from the database, selecting articles to compare with, inserting the related ones) is complete, too. Maybe you can improve the rest, too. But the main point which is important to me is the function check(). So it would be great if you could post some improvements or completely different approaches.

APPROACH 1

<?php
$zeit = time();
function check($str1, $str2){
    $minprozent = 60;
    similar_text($str1, $str2, $prozent);
    $prozent = sprintf("%01.2f", $prozent);
    if ($prozent > $minprozent) {
        return TRUE;
    }
    else {
        return FALSE;
    }
}
$sql1 = "SELECT id, text FROM articles ORDER BY RAND() LIMIT 0, 20";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
    $rel1 = "SELECT id, text, MATCH (text) AGAINST ('".$sql3['text']."') AS score FROM articles WHERE MATCH (text) AGAINST ('".$sql3['text']."') AND id NOT LIKE ".$sql3['id']." LIMIT 0, 20";
    $rel2 = mysql_query($rel1);
    $rel2a = mysql_num_rows($rel2);
    if ($rel2a > 0) {
        while ($rel3 = mysql_fetch_assoc($rel2)) {
            if (check($sql3['text'], $rel3['text']) == TRUE) {
                $id_a = $sql3['id'];
                $id_b = $rel3['id'];
                $rein1 = "INSERT INTO related (article1, article2) VALUES ('".$id_a."', '".$id_b."')";
                $rein2 = mysql_query($rein1);
                $rein3 = "INSERT INTO related (article1, article2) VALUES ('".$id_b."', '".$id_a."')";
                $rein4 = mysql_query($rein3);
            }
        }
    }
}
?>

APPROACH 2 [only check()]

<?php
function square($number) {
    $square = pow($number, 2);
    return $square;
}
function check($text1, $text2) {
    $words_sub = text_splitter($text2); // splits the text into single words
    $words = text_splitter($text1); // splits the text into single words
    // document 1 start
    $document1 = array();
    foreach ($words as $word) {
        if (in_array($word, $words)) {
            if (isset($document1[$word])) { $document1[$word]++; } else { $document1[$word] = 1; }
        }
    }
    $rating1 = 0;
    foreach ($document1 as $temp) {
        $rating1 = $rating1+square($temp);
    }
    $rating1 = sqrt($rating1);
    // document 1 end
    // document 2 start
    $document2 = array();
    foreach ($words_sub as $word_sub) {
        if (in_array($word_sub, $words)) {
            if (isset($document2[$word_sub])) { $document2[$word_sub]++; } else { $document2[$word_sub] = 1; }
        }
    }
    $rating2 = 0;
    foreach ($document2 as $temp) {
        $rating2 = $rating2+square($temp);
    }
    $rating2 = sqrt($rating2);
    // document 2 end
    $skalarprodukt = 0;
    for ($m=0; $m<count($words)-1; $m++) {
        $skalarprodukt = $skalarprodukt+(array_shift($document1)*array_shift($document2));
    }
    if (($rating1*$rating2) == 0) { continue; }
    $kosinusmass = $skalarprodukt/($rating1*$rating2);
    if ($kosinusmass < 0.7) {
        return FALSE;
    }
    else {
        return TRUE;
    }
}
?>

I would also like to say that I know that there are lots of algorithms for clustering but on every site there is only the mathematical description which is a bit difficult to understand for me. So coding examples in (pseudo) code would be great.

I hope you can help me. Thanks in advance!

785

asked May 12 '09 14:05

caw

1 Answers

The most standard way I know of to do this on text data like you have, is to use the 'bag of words' technique.

First, create a 'histogram' of words for each article. Lets say between all your articles, you only have 500 unique words between them. Then this histogram is going to be a vector(Array, List, Whatever) of size 500, where the data is the number of times each word appears in the article. So if the first spot in the vector represented the word 'asked', and that word appeared 5 times in the article, vector[0] would be 5:

for word in article.text
    article.histogram[indexLookup[word]]++

Now, to compare any two articles, it is pretty straightforward. We simply multiply the two vectors:

def check(articleA, articleB)
    rtn = 0
    for a,b in zip(articleA.histogram, articleB.histogram)
        rtn += a*b
    return rtn > threshold

(Sorry for using python instead of PHP, my PHP is rusty and the use of zip makes that bit easier)

This is the basic idea. Notice the threshold value is semi-arbitrary; you'll probably want to find a good way to normalize the dot product of your histograms (this will almost have to factor in the article length somewhere) and decide what you consider 'related'.

Also, you should not just put every word into your histogram. You'll, in general, want to include the ones that are used semi-frequently: Not in every article nor in only one article. This saves you a bit of overhead on your histogram, and increases the value of your relations.

By the way, this technique is described in more detail here

192

answered Sep 23 '22 10:09

Albinofrenchy

Related questions
                            
                                What should be considered when building a Recommendation Engine?
                            
                                Given an array of integers where some numbers repeat 1 time or 2 times but one number repeats 3 times, how do you find it?
                            
                                Is golden section search better than binary search?
                            
                                How to find which elements are in the bag, using Knapsack Algorithm [and not only the bag's value]?
                            
                                Randomly permute N first elements of a singly linked list
                            
                                Given a target sum and a set of integers, find the closest subset of numbers that add to that target
                            
                                Calculate the number of ways to roll a certain number
                            
                                potential O(n) solution to Longest Increasing Subsequence
                            
                                Complexity of partial_sort vs nth_element
                            
                                How do I use the new C++17 execution policies? [duplicate]
                            
                                How to implement lock-free skip list
                            
                                Algorithm for finding visually similar photos from a database?
                            
                                Smallest multiple having only 0 and 4 as its digits
                            
                                Robust Face Detection in C/C++?
                            
                                2d point clustering
                            
                                Recognizing distortions in a regular grid
                            
                                Advantage of depth first search over breadth first search or vice versa
                            
                                Any difference between Lazy evaluation and Short-circuit evaluation?
                            
                                Utility of parameter 'out' in numpy functions
                            
                                Algorithm for suggesting products

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best clustering algorithm? (simply explained)

Tags:

text

algorithm

cluster-analysis

text-mining

data-mining

caw

People also ask

1 Answers

Albinofrenchy

Recent Activity

Donate For Us