Given n number of raw URLs, I'd like to be able to classify them by: news, blog, photo and video.
An example would be if a link directs a user to a photo, would it be enough to say that the raw link contains file extension for images to be able to classify the raw URL as photo?
As for video, blog and news, it seems it isn't enough to have a set of domains (like http://www.youtube.com) that will classify the raw URLs.
Could classification be done by examining the web content? Or are there any open source tools for this?
URL classification is based on real users actively visiting URLs, as opposed to classifying bot traffic. The classification approach employs a crowd-sourced approach for obtaining a constant stream of URLs to analyze.
Stay organized with collections Save and categorize content based on your preferences. Content Classification analyzes a document and returns a list of content categories that apply to the text found in the document. To classify the content in a document, call the classifyText method.
Web page classification is the process of assigning a Web page to one or more predefined categories which plays a vital role in focused crawling, assisted development of Web directories, topic-specific Web link analysis, contextual advertising, and analysis of the Web's topical structure.
The only URLs that may be even somewhat reliably classified, are those that point to a distinct medium (i.e. http://foo.com/foo.jpg is most certainly an image). Otherwise, you must analyze the content of the page.
This can be a bit tricky, as Flash may contain a photo, video, or neither, without providing any searchable clue as to the content of the flash object. With enough effort, this can obviously be overcome (Google does it!), but I'm not aware of any open source resources that provide a library of media-related domains. Such data result from countless programmer-hours of effort -- an effort that typically seeks a return on investment (ROI). Case in point, ClueWeb09 is just a dataset of downloaded pages, used to test search algorithms -- not really sorted or categorized.
"Sometimes no help is the answer."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With