Of course Google has been doing this for years! However, rather than start from scratch, spend 10 years+ and squander large sums of money :) I was wondering if anyone knows of a simple PHP library that would return a list of important words (and/or some sort of context) from a web page or chunk of text using PHP?
On a basic level, I am guessing the most spiders will pull in words, remove words without real meaning, then count the rest. The most occurring words would most likely be what I'm interested in.
Any sort of pointers would be really appreciated!
I can give you pointers, but you want to look up/research Latent Semantic Indexing.
Rather than explain it, here is a quick snippet from a webpage.
Latent semantic indexing is essentially a way of extracting the meaning from a document without matching a specific phrase. A simple example would be that a document featuring the words ‘Windows’, ‘Bing’, ‘Excel’ and ‘Outlook’ would be about Microsoft. You wouldn’t need ‘Microsoft’ to appear again and again to know that.
This example also highlights the importance of taking into account related words because if ‘windows’ appeared on a page that also featured ‘glazing’, it would most likely be an entirely different meaning.
You can of course go down the easy route of dropping all stop words from the text corpus, but LSI is definately more accurate.
I will update this post with more info in about 30 minutes. (Still intending to update this post - Got too busy with work).
Okay, so the basics behind LSA, is to offer a new/different approach for retieving a document based on a particular search time. You could very easily use it for determining the meaning of a document however though too. One of the problems with the search of yester-years was that they were based on keywords analysis. If you take Yahoo/Altavista from the late 1999's through to probably 2002/03 (don't quote me on this), they were extremely dependant on ONLY using keywords as a factor of retrieving a document from their index. Keywords however, don't translate to anything other than the keyword which they represent. However, the keyword "Hot", means lots of things depending on the context which it is placed. If you were to take the term "hot" and identity that it was placed around other terms such as "chillies", "spices" or "herbs", then conceptually it means something totally different to the term "hot" when surronding by other terms such as "heat" or "warmth" or "sexy" and "girl".
LSA attempts to overcome these defficiencies by working upon a matrix of statisical probalities, (which you build yourself).
Anyway onto some tools that help you to build this matrix of document/terms (and cluster them in a proximity which relates to their corpus). This works to the benefit of search engines, by transposing keywords into concepts, so that if you search for a particular keyword, that keyword might not even appear in documents which are retrieved, but the concept which the keyword represents does.
I've always used Lucence / Solr for search. And doing a quick Google search, for Solr LSA LSI returned a few links.
http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
This guy seems to have created a plugin for it.
http://github.com/algoriffic/lsa4solr
I might check it out over the next few weeks and see how it gets on.
Go have a look at Calais and Zemanta. Very cool stuff!
Personally, I'd be inclined to use something like a Brill parser to identify the part of speech of each word, discarding pronouns, verbs, etc and using that to extract a list of nouns (possibly with any qualifying adjectives) to build that list of keywords. You can find a PHP implementation of a Brill Parser on Ian Barber's PHP/IR site.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With