Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Super-fast screen scraping techniques? [closed]

I often find myself needing to do some simple screen scraping for internal purposes (i.e. a third party service I use only publishes reports via HTML). I have at least two or three cases of this now. I could use apache httpclient and create all the necessary screen scraping code but it takes a while. Here is my usual process:

  1. Open up Charles Proxy on the web site and see whats going on.
  2. Start writing some java code using Apache HttpClient, dealing with cookies, multiple requests
  3. use Jericho HTML to deal with parsing of the HTML.

I wish I could just "record my session" quickly and then parametrize the things that vary from session to session. Imagine just using Charles to grab all the request HTTP and then parametrize the relevant query string or post params. Voila I have a reusable http script.

Is there anything that does this already? I remember when I used to work at a big company there used to be a tool we used called Load Runner by Mercury Interactive that essentially had a nice way to record an http session and make it reusable (for testing purposes). That tool, unfortunately, is very expensive.

like image 479
Ish Avatar asked Feb 26 '09 08:02

Ish


People also ask

Can you get blocked for web scraping?

Captcha Solving Servicesf you are scraping a website on a large scale, the website will eventually block you. You will start seeing captcha pages instead of web pages. There are services to get past these restrictions such as Scrapingdog.

How can I make my site scrape faster?

Minimize the number of requests sent If you can reduce the number of requests sent, your scraper will be much faster. For example, if you are scraping prices and titles from an e-commerce site, then you don't need to visit each item's page. You can get all the data you need from the results page.


2 Answers

HtmlUnit is a scriptable, headless browser written in Java. We use it for some extremely fault-heavy, complex web pages and it usually does a very good job.

To simplify things even more you can run it in Jython. The resultant program reads more like a transcript of how one might use a browser than hard work.

like image 121
toothygoose Avatar answered Sep 23 '22 08:09

toothygoose


You don't mention what you want to use this for; One solution is to simply "script" your web browser using tools like Selenium if having a web browser repeat your actions is an acceptable solution. You can use the Selenium IDE to record what you do and then alter the parameters.

like image 36
Mark Fowler Avatar answered Sep 24 '22 08:09

Mark Fowler