Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to categorize urls using machine learning?

I'm indexing websites' content and I want to implement some categorization based solely on the urls.

I would like to tell appart content view pages from navigation pages. By 'content view pages' I mean webpages where one can typically see the details of a product or a written article. By 'navigation pages' I mean pages that (typically) consist of lists of links to content pages or to other more specific list pages.

Although some sites use a site wide key system to map their content, most of the sites do it bit by bit and scope their key mapping, so this should be possible.

In practice, what I want to do is take the list of urls from a site and group them by similarity. I believe this can be done with machine learning, but I have no idea how. Machine learning appear to be a broad topic, what should I start reading about in particular? Which concepts, which algoritms, which tools?

like image 721
Pico Avatar asked Nov 01 '12 10:11

Pico


People also ask

What is content categorization?

Content categorization is the way the process of a customer finding a product is enabled. It is a part of what is known as the content management strategy of an organization.


1 Answers

If you want to discover these groups automatically, I suggest you find yourself an implementation of a clustering algorithm (K-Means is probably the most popular, you don't say what language you want to do this in). You know there are two categories, so something that allows you to specify the number of categories a priori will make the problem easier.

After that, define a bunch of features for your webpages, and run them through k-means to see what kind of groups are produced. Tweak the features you use til you get something that looks satisfactory. If you have access to the webpages themselves, I'd strongly recommend using features defined over the whole page, rather than just the URLs.

like image 187
Ben Allison Avatar answered Oct 06 '22 08:10

Ben Allison