I am working on an app for doing screen scraping of small portions of external web pages (not an entire page, just a small subset of it).
So I have the code working perfectly for scraping the html, but my problem is that I want to scrape not just the raw html, but also the CSS styles used to format the section of the page I am extracting, so I can display on a new page with it's original formatting intact.
If you are familiar with firebug, it is able to display which CSS styles are applicable to the specific subset of the page you have highlighted, so if I could figure out a way to do that, then I could just use those styles when displaying the content on my new page. But I have no idea how to do this........
Today I needed to scrape Facebook share dialogs to be used as dynamic preview samples in our app builder for facebook apps. I've taken Firebug 1.5 codebase and added a new context menu option "Copy HTML with inlined styles". I've copied their getElementHTML function from lib.js and modified it to do this:
It works well for simpler pages, but the solution is not 100% robust because of bugs in Firebug (or Firefox?). But it is definitely usable when operated by a web developer who can debug and fix all quirks.
Problems I've found so far:
Anyway, this solution saved lot of my time. Originally I was manually selecting pieces of their stylesheets and doing manual selection and postprocessing. It was slow, boring and polluted our class namespace. Now I'm able to scrape facebook markup in minutes instead of hours and exported markup does not interfere with the rest of the page.
A good start would be the following: make a pass through the patch of HTML you plan to extract, collecting each element (and its ID/classes/inline styles) to an array. Grab the styles for those element IDs & classes from the page's stylesheets immediately.
Then, from the outermost element(s) in the target patch, work your way up through the rest of the elements in the DOM in a similar fashion, eventually all the way up to the body and HTML elements, comparing against your initial array and collecting any styles that weren't declared within the target patch or its applied styles.
You'll also want to check for any * declarations and grab those as well. Then, make sure when you're reapplying the styles to your eventual output you do so in the right order, as you collected them from low-to-high in the DOM hierarchy and they'll need to be reapplied high-to-low.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With