Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering a huge number of URLs

I have to find similar URLs like

'http://teethwhitening360.com/teeth-whitening-treatments/18/'
'http://teethwhitening360.com/laser-teeth-whitening/22/'
'http://teethwhitening360.com/teeth-whitening-products/21/' 'http://unwanted-hair-removal.blogspot.com/2008/03/breakthroughs-in-unwanted-hair-remo'
'http://unwanted-hair-removal.blogspot.com/2008/03/unwanted-hair-removal-products.html'
'http://unwanted-hair-removal.blogspot.com/2008/03/unwanted-hair-removal-by-shaving.ht'

and gather them in groups or clusters. My problems:

  • The number of URLs is large (1,580,000)
  • I don't know which clustering or method of finding similarities is better

I would appreciate any suggestion on this.

like image 720
Soghra Gargari Avatar asked Dec 09 '22 03:12

Soghra Gargari


2 Answers

There are a few problems at play here. First you'll probably want to wash the URLs with a dictionary, for example to convert

http://teethwhitening360.com/teeth-whitening-treatments/18/

to

teeth whitening 360 com teeth whitening treatments 18

then you may want to stem the words somehow, eg using the Porter stemmer:

teeth whiten 360 com teeth whiten treatment 18

Then you can use a simple vector space model to map the URLs in an n-dimensional space, then just run k-means clustering on them? It's a basic approach but it should work.

The number of URLs involved shouldn't be a problem, it depends what language/environment you're using. I would think Matlab would be able to handle it.

like image 111
Howard Avatar answered Jan 06 '23 18:01

Howard


Tokenizing and stemming are obvious things to do. You can then turn these vectors into TF-IDF sparse vector data easily. Crawling the actual web pages to get additional tokens is probably too much work?

After this, you should be able to use any flexible clustering algorithm on the data set. With flexible I mean that you need to be able to use for example cosine distance instead of euclidean distance (which does not work well on sparse vectors). k-means in GNU R for example only supports Euclidean distance and dense vectors, unfortunately. Ideally, choose a framework that is very flexible, but also optimizes well. If you want to try k-means, since it is a simple (and thus fast) and well established algorithm, I belive there is a variant called "convex k-means" that could be applicable for cosine distance and sparse tf-idf vectors.

Classic "hierarchical clustering" (apart from being outdated and performing not very well) is usually a problem due to the O(n^3) complexity of most algorithms and implementations. There are some specialized cases where a O(n^2) algorithm is known (SLINK, CLINK) but often the toolboxes only offer the naive cubic-time implementation (including GNU R, Matlab, sciPy, from what I just googled). Plus again, they often will only have a limited choice of distance functions available, probably not including cosine.

The methods are, however, often easy enough to implement yourself, in an optimized way for your actual use case.

like image 43
Has QUIT--Anony-Mousse Avatar answered Jan 06 '23 17:01

Has QUIT--Anony-Mousse