Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP library for word clustering/NLP?

What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.

After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.

  • Is there such a PHP library out there that I might have missed?
  • If not, is there any FOSS that handles clustering and has a decent API?
like image 783
vzwick Avatar asked Nov 02 '11 11:11

vzwick


2 Answers

Like this:

Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.

The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.

$content=preg_replace('/[^a-z\s]/', '', $content); // remove punctuation

$stopwords='the|and|is|your|me|for|where|etc...';
$stopwords=explode('|',$stopwords);
$stopwords=array_flip($stopwords);

$result=array(); $temp=array();
foreach ($content as $s)
if (isset($stopwords[$s]) OR strlen($s)<3)
 {
 if (sizeof($temp)>0)
  {
  $result[]=implode(' ',$temp);
  $temp=array();
  }            
 } else $temp[]=$s;
if (sizeof($temp)>0) $result[]=implode(' ',$temp);

$phrases=array_count_values($result);
arsort($phrases);

Now you have an associative array in order of the frequency of terms that occur in your input data.

How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.

I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.

Let me know if you have any trouble with this.

like image 106
Alasdair Avatar answered Oct 16 '22 18:10

Alasdair


"... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.

For starters you could look into K-Means clustering.

Have a look at this page and website:

PHP/irInformation Retrieval and other interesting topics

EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.

EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!

Dmoz/Monster algorithme to calculate count of each category and sub category?

like image 21
zaf Avatar answered Oct 16 '22 18:10

zaf