Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Screen scraping pages that use CSS for layout and formatting...how to scrape the CSS applicable to the html?

I am working on an app for doing screen scraping of small portions of external web pages (not an entire page, just a small subset of it).

So I have the code working perfectly for scraping the html, but my problem is that I want to scrape not just the raw html, but also the CSS styles used to format the section of the page I am extracting, so I can display on a new page with it's original formatting intact.

If you are familiar with firebug, it is able to display which CSS styles are applicable to the specific subset of the page you have highlighted, so if I could figure out a way to do that, then I could just use those styles when displaying the content on my new page. But I have no idea how to do this........

like image 464
tbone Avatar asked Nov 18 '08 17:11

tbone


2 Answers

Today I needed to scrape Facebook share dialogs to be used as dynamic preview samples in our app builder for facebook apps. I've taken Firebug 1.5 codebase and added a new context menu option "Copy HTML with inlined styles". I've copied their getElementHTML function from lib.js and modified it to do this:

  • remove class, id and style attributes
  • remove onclick and similar javascript handlers
  • remove all data-something attributes
  • remove explicit hrefs and replace them with "#"
  • replace all block level elements with div and inline element with span (to prevent inheriting styles on target page)
  • absolutize relative urls
  • inline all applied non-default css atributes into brand new style attribute
  • reduce inline style bloat by considering styling parent/child inheritance by traversion DOM tree up
  • indent output

It works well for simpler pages, but the solution is not 100% robust because of bugs in Firebug (or Firefox?). But it is definitely usable when operated by a web developer who can debug and fix all quirks.

Problems I've found so far:

  • sometimes clear css property is not emitted (it breaks layout pretty badly)
  • :hover and other pseudo-classes cannot be captured this way
  • firefox keeps only mozilla specific css properties/values in it's model, so for example you lose -webkit-border-radius, because this was skipped by CSS parser

Anyway, this solution saved lot of my time. Originally I was manually selecting pieces of their stylesheets and doing manual selection and postprocessing. It was slow, boring and polluted our class namespace. Now I'm able to scrape facebook markup in minutes instead of hours and exported markup does not interfere with the rest of the page.

like image 148
Antonin Hildebrand Avatar answered Nov 08 '22 03:11

Antonin Hildebrand


A good start would be the following: make a pass through the patch of HTML you plan to extract, collecting each element (and its ID/classes/inline styles) to an array. Grab the styles for those element IDs & classes from the page's stylesheets immediately.

Then, from the outermost element(s) in the target patch, work your way up through the rest of the elements in the DOM in a similar fashion, eventually all the way up to the body and HTML elements, comparing against your initial array and collecting any styles that weren't declared within the target patch or its applied styles.

You'll also want to check for any * declarations and grab those as well. Then, make sure when you're reapplying the styles to your eventual output you do so in the right order, as you collected them from low-to-high in the DOM hierarchy and they'll need to be reapplied high-to-low.

like image 26
Evan Avatar answered Nov 08 '22 03:11

Evan