I've tried to find som good how to, or some example that is good for beginners when it comes to write your first web crawler. I would like to write it in c#. Does anybody have any good example code to share or some tips on some sites where I can find info for c#, and some bacic webcrawling.
Thanks
HtmlAgilityPack is your friend.
Yes, HtmlAgeilityPack is a good tool to parse the HTML but that is definitely not enough.
There are 3 elements to crawling:
1) Crawling itself i.e. looping through web sites: This can be done by sending requests to random IP addresses but this does not work well since many websites use shared IP address HTTP with host header so using IP does not hit it. On the other hand, there are far too many IP addresses unused or not hosting a web server so this does not get you anywhere.
I suggest you send request to google (search for words from a dictionary) and crawl the results coming back.
2) Rendering the content: Many websites generate the HTML content in JavaScript when the form is loaded so if you send a simple request, it will not be able to capture the content as a user would be able to see. You need to render the page as browser does and that can be done using Webkit.net which is an open source tool although still in beta.
3) Comprehending and parsing the HTML: use HTML pack and there are tons of examples online. This can be used to crawl the site as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With