Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best clustering algorithm? (simply explained)

Imagine the following problem:

  • You have a database containing about 20,000 texts in a table called "articles"
  • You want to connect the related ones using a clustering algorithm in order to display related articles together
  • The algorithm should do flat clustering (not hierarchical)
  • The related articles should be inserted into the table "related"
  • The clustering algorithm should decide whether two or more articles are related or not based on the texts
  • I want to code in PHP but examples with pseudo code or other programming languages are ok, too

I've coded a first draft with a function check() which gives "true" if the two input articles are related and "false" if not. The rest of the code (selecting the articles from the database, selecting articles to compare with, inserting the related ones) is complete, too. Maybe you can improve the rest, too. But the main point which is important to me is the function check(). So it would be great if you could post some improvements or completely different approaches.

APPROACH 1

<?php
$zeit = time();
function check($str1, $str2){
    $minprozent = 60;
    similar_text($str1, $str2, $prozent);
    $prozent = sprintf("%01.2f", $prozent);
    if ($prozent > $minprozent) {
        return TRUE;
    }
    else {
        return FALSE;
    }
}
$sql1 = "SELECT id, text FROM articles ORDER BY RAND() LIMIT 0, 20";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
    $rel1 = "SELECT id, text, MATCH (text) AGAINST ('".$sql3['text']."') AS score FROM articles WHERE MATCH (text) AGAINST ('".$sql3['text']."') AND id NOT LIKE ".$sql3['id']." LIMIT 0, 20";
    $rel2 = mysql_query($rel1);
    $rel2a = mysql_num_rows($rel2);
    if ($rel2a > 0) {
        while ($rel3 = mysql_fetch_assoc($rel2)) {
            if (check($sql3['text'], $rel3['text']) == TRUE) {
                $id_a = $sql3['id'];
                $id_b = $rel3['id'];
                $rein1 = "INSERT INTO related (article1, article2) VALUES ('".$id_a."', '".$id_b."')";
                $rein2 = mysql_query($rein1);
                $rein3 = "INSERT INTO related (article1, article2) VALUES ('".$id_b."', '".$id_a."')";
                $rein4 = mysql_query($rein3);
            }
        }
    }
}
?>

APPROACH 2 [only check()]

<?php
function square($number) {
    $square = pow($number, 2);
    return $square;
}
function check($text1, $text2) {
    $words_sub = text_splitter($text2); // splits the text into single words
    $words = text_splitter($text1); // splits the text into single words
    // document 1 start
    $document1 = array();
    foreach ($words as $word) {
        if (in_array($word, $words)) {
            if (isset($document1[$word])) { $document1[$word]++; } else { $document1[$word] = 1; }
        }
    }
    $rating1 = 0;
    foreach ($document1 as $temp) {
        $rating1 = $rating1+square($temp);
    }
    $rating1 = sqrt($rating1);
    // document 1 end
    // document 2 start
    $document2 = array();
    foreach ($words_sub as $word_sub) {
        if (in_array($word_sub, $words)) {
            if (isset($document2[$word_sub])) { $document2[$word_sub]++; } else { $document2[$word_sub] = 1; }
        }
    }
    $rating2 = 0;
    foreach ($document2 as $temp) {
        $rating2 = $rating2+square($temp);
    }
    $rating2 = sqrt($rating2);
    // document 2 end
    $skalarprodukt = 0;
    for ($m=0; $m<count($words)-1; $m++) {
        $skalarprodukt = $skalarprodukt+(array_shift($document1)*array_shift($document2));
    }
    if (($rating1*$rating2) == 0) { continue; }
    $kosinusmass = $skalarprodukt/($rating1*$rating2);
    if ($kosinusmass < 0.7) {
        return FALSE;
    }
    else {
        return TRUE;
    }
}
?>

I would also like to say that I know that there are lots of algorithms for clustering but on every site there is only the mathematical description which is a bit difficult to understand for me. So coding examples in (pseudo) code would be great.

I hope you can help me. Thanks in advance!

like image 785
caw Avatar asked May 12 '09 14:05

caw


People also ask

Which clustering algorithm is better than k-means?

Fuzzy c-means clustering has can be considered a better algorithm compared to the k-Means algorithm. Unlike the k-Means algorithm where the data points exclusively belong to one cluster, in the case of the fuzzy c-means algorithm, the data point can belong to more than one cluster with a likelihood.

What is the most commonly used clustering technique?

K-means clustering is the most commonly used clustering algorithm. It's a centroid-based algorithm and the simplest unsupervised learning algorithm. This algorithm tries to minimize the variance of data points within a cluster.

Why is K means clustering best?

Guarantees convergence. Can warm-start the positions of centroids. Easily adapts to new examples. Generalizes to clusters of different shapes and sizes, such as elliptical clusters.


1 Answers

The most standard way I know of to do this on text data like you have, is to use the 'bag of words' technique.

First, create a 'histogram' of words for each article. Lets say between all your articles, you only have 500 unique words between them. Then this histogram is going to be a vector(Array, List, Whatever) of size 500, where the data is the number of times each word appears in the article. So if the first spot in the vector represented the word 'asked', and that word appeared 5 times in the article, vector[0] would be 5:

for word in article.text
    article.histogram[indexLookup[word]]++

Now, to compare any two articles, it is pretty straightforward. We simply multiply the two vectors:

def check(articleA, articleB)
    rtn = 0
    for a,b in zip(articleA.histogram, articleB.histogram)
        rtn += a*b
    return rtn > threshold

(Sorry for using python instead of PHP, my PHP is rusty and the use of zip makes that bit easier)

This is the basic idea. Notice the threshold value is semi-arbitrary; you'll probably want to find a good way to normalize the dot product of your histograms (this will almost have to factor in the article length somewhere) and decide what you consider 'related'.

Also, you should not just put every word into your histogram. You'll, in general, want to include the ones that are used semi-frequently: Not in every article nor in only one article. This saves you a bit of overhead on your histogram, and increases the value of your relations.

By the way, this technique is described in more detail here

like image 192
Albinofrenchy Avatar answered Sep 23 '22 10:09

Albinofrenchy