Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Content Classification from URL [closed]

Given n number of raw URLs, I'd like to be able to classify them by: news, blog, photo and video.

An example would be if a link directs a user to a photo, would it be enough to say that the raw link contains file extension for images to be able to classify the raw URL as photo?

As for video, blog and news, it seems it isn't enough to have a set of domains (like http://www.youtube.com) that will classify the raw URLs.

Could classification be done by examining the web content? Or are there any open source tools for this?

like image 235
eunique0216 Avatar asked Feb 17 '11 03:02

eunique0216


People also ask

How do you classify a URL?

URL classification is based on real users actively visiting URLs, as opposed to classifying bot traffic. The classification approach employs a crowd-sourced approach for obtaining a constant stream of URLs to analyze.

How do you classify content?

Stay organized with collections Save and categorize content based on your preferences. Content Classification analyzes a document and returns a list of content categories that apply to the text found in the document. To classify the content in a document, call the classifyText method.

What is Web classification?

Web page classification is the process of assigning a Web page to one or more predefined categories which plays a vital role in focused crawling, assisted development of Web directories, topic-specific Web link analysis, contextual advertising, and analysis of the Web's topical structure.


1 Answers

The only URLs that may be even somewhat reliably classified, are those that point to a distinct medium (i.e. http://foo.com/foo.jpg is most certainly an image). Otherwise, you must analyze the content of the page.

This can be a bit tricky, as Flash may contain a photo, video, or neither, without providing any searchable clue as to the content of the flash object. With enough effort, this can obviously be overcome (Google does it!), but I'm not aware of any open source resources that provide a library of media-related domains. Such data result from countless programmer-hours of effort -- an effort that typically seeks a return on investment (ROI). Case in point, ClueWeb09 is just a dataset of downloaded pages, used to test search algorithms -- not really sorted or categorized.

"Sometimes no help is the answer."

like image 177
Mike Christian Avatar answered Sep 27 '22 18:09

Mike Christian