I have a webcrawler application. It successfully crawled most common and simple sites. Now i ran into some types of websites wherein HTML documents are dynamically generated through FORMS or javascripts. I believe they can be crawled and I just don't know how. Now, these websites do not show the actual HTML page. I mean if I browse that page in IE or firefox, the HTML code does not match what's actually in the IE or firefox. These sites contain textboxes, checkboxes, etc... so I believe they are what they call "Web Forms". Actually I am not much familiar with web development so correct me if I'm wrong.
My question is, does anyone in similar situation as I am now and have successfully solved these types of "challenges"? Does anyone know of a certain book or article regarding web crawling? Those that pertains to these advanced type of websites?
Thanks.
There are two separate issues here.
As a rule of thumb, crawlers do not touch forms.
It might be appropriate to write something for a specific website, that submits predetermined (or semi-random) data (particularly when writing automated tests for your own web applications), but generic crawlers should leave them well alone.
The spec describing how to submit form data is available at http://www.w3.org/TR/html4/interact/forms.html#h-17.13, there may be a library for C# that will help.
JavaScript is a rather complicated beast.
There are three common ways you can deal with it:
I found an article which tackles deep web and its very interesting and I think this answers my questions above.
http://www.trycatchfail.com/2008/11/10/creating-a-deep-web-crawler-with-net-background/
Gotta love this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With