Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web Crawling Sites with Javascripts or web forms

I have a webcrawler application. It successfully crawled most common and simple sites. Now i ran into some types of websites wherein HTML documents are dynamically generated through FORMS or javascripts. I believe they can be crawled and I just don't know how. Now, these websites do not show the actual HTML page. I mean if I browse that page in IE or firefox, the HTML code does not match what's actually in the IE or firefox. These sites contain textboxes, checkboxes, etc... so I believe they are what they call "Web Forms". Actually I am not much familiar with web development so correct me if I'm wrong.

My question is, does anyone in similar situation as I am now and have successfully solved these types of "challenges"? Does anyone know of a certain book or article regarding web crawling? Those that pertains to these advanced type of websites?

Thanks.

like image 284
Jojo Avatar asked Nov 05 '22 14:11

Jojo


2 Answers

There are two separate issues here.

Forms

As a rule of thumb, crawlers do not touch forms.

It might be appropriate to write something for a specific website, that submits predetermined (or semi-random) data (particularly when writing automated tests for your own web applications), but generic crawlers should leave them well alone.

The spec describing how to submit form data is available at http://www.w3.org/TR/html4/interact/forms.html#h-17.13, there may be a library for C# that will help.

JavaScript

JavaScript is a rather complicated beast.

There are three common ways you can deal with it:

  1. Write your crawler so it duplicates the JS functionality of specific websites that you care about.
  2. Automate a web browser
  3. Use something like Rhino with env.js
like image 154
Quentin Avatar answered Nov 12 '22 19:11

Quentin


I found an article which tackles deep web and its very interesting and I think this answers my questions above.

http://www.trycatchfail.com/2008/11/10/creating-a-deep-web-crawler-with-net-background/

Gotta love this.

like image 35
Jojo Avatar answered Nov 12 '22 19:11

Jojo