Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract Relevant Tag/Keywords from Text block

I wanted a particular implementation, such that the user provide a block of text like:

"Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable."

What I want to do is automatically select relevant keywords and create tags/keywords, hence for the above piece of text, relevant tags should be: mysql, php, json, jquery, version control, oop, web2.0, javascript

How can I go about doing it in PHP/Javascript etc? A headstart would be really helpful.

like image 726
sgomez Avatar asked Jan 28 '11 12:01

sgomez


2 Answers

A very naive method is to remove common stopwords from the text, leaving you with more meaningful words like 'Standards', 'JSON', etc. You will still get a lot of noise however, so you may consider a service like OpenCalais which can do a rather sophisticated analysis of your text.

Update:

Okay, the link in my previous answer pointed to implementations, but you asked for one so a simple one is here:

function stopWords($text, $stopwords) {

  // Remove line breaks and spaces from stopwords
    $stopwords = array_map(function($x){return trim(strtolower($x));}, $stopwords);

  // Replace all non-word chars with comma
  $pattern = '/[0-9\W]/';
  $text = preg_replace($pattern, ',', $text);

  // Create an array from $text
  $text_array = explode(",",$text);

  // remove whitespace and lowercase words in $text
  $text_array = array_map(function($x){return trim(strtolower($x));}, $text_array);

  foreach ($text_array as $term) {
    if (!in_array($term, $stopwords)) {
      $keywords[] = $term;
    }
  };

  return array_filter($keywords);
}

$stopwords = file('stop_words.txt');
$text = "Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable.";

print_r(stopWords($text, $stopwords));

You can see this, and the contents of stop_word.txt in this Gist.

Running the above on your example text produces the following array:

Array
(
    [0] => requirements
    [4] => linux
    [6] => apache
    [10] => mysql
    [13] => php
    [25] => json
    [28] => frameworks
    [30] => zend
    [34] => browser
    [35] => javascripting
    [37] => jquery
    [38] => etc
    [42] => software
    [43] => preferable
)

So, like I said, this is somewhat naive and could use more optimization (plus it's slow) but it does pull out the more relevant keywords from your text. You would need to do some fine tuning on the stop words as well. Capturing terms like Web 2.0 will be very difficult, so again I think you would be better off using a serious service like OpenCalais which can understand a text and return a list of entities and references. DocumentCloud relies on this very service to gather information from documents.

Also, for client side implementation you could do pretty much the same thing with JavaScript, and probably much cleaner (although it could be slow for the client.)

like image 83
Darren Newton Avatar answered Nov 13 '22 11:11

Darren Newton


I did a quick review of these this morning and to my surprise one which performs best with my test phrase was written in PHP

  • http://code.fivefilters.org/term-extraction
  • demo: http://fivefilters.org/term-extraction/

What looked like the most professional one performed abysmally: viewer.opencalais.com

Others that were OK were (not sure what language they're written in)

  • www.nactem.ac.uk/software/termine/#form
  • www.alchemyapi.com/api/keyword/
like image 38
user2412642 Avatar answered Nov 13 '22 12:11

user2412642