Clustering a huge number of URLs

Question

I have to find similar URLs like

'http://teethwhitening360.com/teeth-whitening-treatments/18/'
'http://teethwhitening360.com/laser-teeth-whitening/22/'
'http://teethwhitening360.com/teeth-whitening-products/21/' 'http://unwanted-hair-removal.blogspot.com/2008/03/breakthroughs-in-unwanted-hair-remo'
'http://unwanted-hair-removal.blogspot.com/2008/03/unwanted-hair-removal-products.html'
'http://unwanted-hair-removal.blogspot.com/2008/03/unwanted-hair-removal-by-shaving.ht'

and gather them in groups or clusters. My problems:

The number of URLs is large (1,580,000)
I don't know which clustering or method of finding similarities is better

I would appreciate any suggestion on this.

Howard · Accepted Answer

There are a few problems at play here. First you'll probably want to wash the URLs with a dictionary, for example to convert

http://teethwhitening360.com/teeth-whitening-treatments/18/

to

teeth whitening 360 com teeth whitening treatments 18

then you may want to stem the words somehow, eg using the Porter stemmer:

teeth whiten 360 com teeth whiten treatment 18

Then you can use a simple vector space model to map the URLs in an n-dimensional space, then just run k-means clustering on them? It's a basic approach but it should work.

The number of URLs involved shouldn't be a problem, it depends what language/environment you're using. I would think Matlab would be able to handle it.

Has QUIT--Anony-Mousse · Answer

Tokenizing and stemming are obvious things to do. You can then turn these vectors into TF-IDF sparse vector data easily. Crawling the actual web pages to get additional tokens is probably too much work?

After this, you should be able to use any flexible clustering algorithm on the data set. With flexible I mean that you need to be able to use for example cosine distance instead of euclidean distance (which does not work well on sparse vectors). k-means in GNU R for example only supports Euclidean distance and dense vectors, unfortunately. Ideally, choose a framework that is very flexible, but also optimizes well. If you want to try k-means, since it is a simple (and thus fast) and well established algorithm, I belive there is a variant called "convex k-means" that could be applicable for cosine distance and sparse tf-idf vectors.

Classic "hierarchical clustering" (apart from being outdated and performing not very well) is usually a problem due to the O(n^3) complexity of most algorithms and implementations. There are some specialized cases where a O(n^2) algorithm is known (SLINK, CLINK) but often the toolboxes only offer the naive cubic-time implementation (including GNU R, Matlab, sciPy, from what I just googled). Plus again, they often will only have a limited choice of distance functions available, probably not including cosine.

The methods are, however, often easy enough to implement yourself, in an optimized way for your actual use case.

Clustering a huge number of URLs

Tags:

url

cluster-analysis

similarity

Soghra Gargari

2 Answers

Howard

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us

Clustering a huge number of URLs

Tags:

url

cluster-analysis

similarity

Soghra Gargari

2 Answers

Howard

Has QUIT--Anony-Mousse

Related questions

Recent Activity

Donate For Us