Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape a dynamic website

What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new.

--Edit-- For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api.

like image 806
Colin Barnes Avatar asked Oct 15 '08 23:10

Colin Barnes


1 Answers

This is a difficult problem because you either have to reverse engineer the javascript on a per-site basis, or implement a javascript engine and run the scripts (which has its own difficulties and pitfalls).

It's a heavy weight solution, but I've seen people doing this with greasemonkey scripts - allow Firefox to render everything and run the javascript, and then scrape the elements. You can even initiate user actions on the page if needed.

-Adam

like image 92
Adam Davis Avatar answered Sep 27 '22 20:09

Adam Davis