Web Crawling Sites with Javascripts or web forms

Question

I have a webcrawler application. It successfully crawled most common and simple sites. Now i ran into some types of websites wherein HTML documents are dynamically generated through FORMS or javascripts. I believe they can be crawled and I just don't know how. Now, these websites do not show the actual HTML page. I mean if I browse that page in IE or firefox, the HTML code does not match what's actually in the IE or firefox. These sites contain textboxes, checkboxes, etc... so I believe they are what they call "Web Forms". Actually I am not much familiar with web development so correct me if I'm wrong.

My question is, does anyone in similar situation as I am now and have successfully solved these types of "challenges"? Does anyone know of a certain book or article regarding web crawling? Those that pertains to these advanced type of websites?

Thanks.

My question is, does anyone in similar situation as I am now and have successfully solved these types of "challenges"? Does anyone know of a certain book or article regarding web crawling? Those that pertains to these advanced type of websites?

Thanks.

Quentin · Accepted Answer

There are two separate issues here.

Forms

As a rule of thumb, crawlers do not touch forms.

It might be appropriate to write something for a specific website, that submits predetermined (or semi-random) data (particularly when writing automated tests for your own web applications), but generic crawlers should leave them well alone.

The spec describing how to submit form data is available at http://www.w3.org/TR/html4/interact/forms.html#h-17.13, there may be a library for C# that will help.

JavaScript

JavaScript is a rather complicated beast.

There are three common ways you can deal with it:

Write your crawler so it duplicates the JS functionality of specific websites that you care about.
Automate a web browser
Use something like Rhino with env.js

Jojo · Answer

I found an article which tackles deep web and its very interesting and I think this answers my questions above.

http://www.trycatchfail.com/2008/11/10/creating-a-deep-web-crawler-with-net-background/

Gotta love this.

Web Crawling Sites with Javascripts or web forms

Tags:

javascript

c#

windows

webforms

Jojo

2 Answers

Forms

JavaScript

Quentin

Jojo

Recent Activity

Donate For Us

Web Crawling Sites with Javascripts or web forms

Tags:

javascript

c#

windows

webforms

Jojo

2 Answers

Forms

JavaScript

Quentin

Jojo

Related questions

Recent Activity

Donate For Us