Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Clustering Search Engine Keywords

Python: Clustering Search Engine Keywords

Hi, I have a CSV, up to 20,000 rows (I have had 100,000+ for different websites), each row containing a referring keyword (i.e. a keyword someone typed into a search engine to find the website in question), and a number of visits.

What I'm looking to do is cluster these keywords into clusters of "similar meaning", and create a hierarchy of the clusters (structured in order of summed total number of searches per cluster).

An example cluster - "womens clothing" - would ideally contain keywords along these lines: womens clothing, 1000 ladies wear, 300 womens clothes, 50 ladies clothing, 6 womens wear, 2

I could look to use something like the Python Natural Language Toolkit: http://www.nltk.org/ and WordNet, but, I'm guessing that for some websites the referring keywords will be words/phrases that WordNet knows nothing about. For example, if the website is a celebrity website WordNet is unlikely to know anything about "Lady Gaga", worse situation if the website is a news website.

So, I'm also guessing therefore that the solution has to be one that looks to use just the source data itself.

My query is very similar to the one raised at How to cluster search engine keywords?, only I'm looking for somewhere to start but using Python instead of Java.

I did also wonder whether Google Predict and/or Google Refine might be of any use.

Anyway, any thoughts/suggestions most welcome,

Thanks, C

like image 876
user679134 Avatar asked Mar 28 '11 10:03

user679134


1 Answers

I like Woosh a lot. It is a pure python search engine that provides, among other things, that kind of functionality. Check it out.

http://packages.python.org/Whoosh/index.html

The feature that you are looking is call "faceted search results"

http://packages.python.org/Whoosh/facets.html

Hernan

like image 52
Hernan Avatar answered Oct 04 '22 14:10

Hernan