Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inferring templates from a collection of strings

I am indexing a set of websites which have a very large number of pages (tens of millions) that are generated from a small number of templates. I am looking for an algorithm to learn the templates that the pages were generated from and to match templates to pages so that I need to store only the variable part and a template reference for each page fetched.

The algorithm need not produce the greatest compression possible, but it should hopefully become better as it sees more pages and it should behave gracefully when faced with a page generated using a previously unseen template.

I would greatly appreciate any references to literature or existing libraries.

I could run a general purpose compression algorithm on batches of pages. The reason I do not want to do that is that the data of interest to me would be in the variable part of the pages and so the template approach would allow me to retrive it without uncompressing. I want to able to recreate the full page if needed both to ensure future replicability and to guard against bugs in my scraping program.

like image 357
Jyotirmoy Bhattacharya Avatar asked Jun 07 '11 07:06

Jyotirmoy Bhattacharya


1 Answers

In some circles, this problem is known as "HTML Wrapper Induction" or "Wrapper Learning". In this paper you can find an interesting -- albeit old -- review along with links to some commercial applications: http://www.xrce.xerox.com/Research-Development/Historical-projects/IWRAP-Intelligent-Wrapper-Learning-Tools)

You may be interested in this Python library: http://code.google.com/p/templatemaker/ "Well, say you want to get the raw data from a bunch of Web pages that use the same template -- like restaurant reviews on Yelp.com, for instance. You can give templatemaker an arbitrary number of HTML files, and it will create the "template" that was used to create those files." (http://www.holovaty.com/writing/templatemaker/)

Also, another Python library called scrapy seems to have a wrapper induction library: http://dev.scrapy.org/wiki/Scrapy09Changes#Addedwrapperinductionlibrary

I can't tell much about the algorithms, though. If you want to implement one yourself, this looks like a good starting point: http://portal.acm.org/citation.cfm?id=1859138 It features both wrapper induction and online learning, so you can start to classify pages as you continue in the crawling process.

like image 106
Ruggiero Spearman Avatar answered Oct 19 '22 09:10

Ruggiero Spearman