If I have a forums site with a large number of threads, will the search engine bot crawl the whole site every time? Say I have over 1,000,000 threads in my site, will they get crawled every time the bot crawls my site? or how does it work? I want my website to be indexed but I don't want the bot to kill my website! In other words I don't want the bot to keep crawling the old threads again and again every time it crawls my website.
Also, what about the pages crawled before? Will the bot request them every time it crawls my website to make sure they are still on the site? I'm asking this because I only link to the latest threads, i.e. there's a page that contains a list of all the latest threads, but I don't link to the older threads, they have to be explicitly requested by URL, e.g. http://example.com/showthread.aspx?threadid=7, will this work to stop the bot from bringing my site down and consuming all my bandwidth?
P.S. The site is still under development but I want to know in order to design the site so that search engine bots don't bring it down.
Generally, Googlebot crawls over HTTP/1.1. However, Googlebot may crawl sites that may benefit from it over HTTP/2 if it's supported by the site. This may save computing resources (for example, CPU, RAM) for the site and Googlebot, but otherwise it doesn't affect indexing or ranking of your site.
How does web crawling work? Search engines use their own web crawlers to discover and access web pages. All commercial search engine crawlers begin crawling a website by downloading its robots. txt file, which contains rules about what pages search engines should or should not crawl on the website.
After a crawler finds a page, the search engine renders it just like a browser would. In the process of doing so, the search engine analyzes that page's contents. All of that information is stored in its index.
Bad bots can help steal your private data or take down an otherwise operating website. We want to block any bad bots we can uncover. It's not easy to discover every bot that may crawl your site but with a little bit of digging, you can find malicious ones that you don't want to visit your site anymore.
Complicated stuff.
From my experience, it depends more on what URL scheme do you use to link pages together that will determine if the crawler will crawls which pages.
Most engine crawl the entire website, if it is all properly hyperlinked with a crawl-friendly URLs e.g. use URL rewriting instead of a topicID=123 querystrings and that all pages are easily linkable a few clicks from the main page.
Another case is paging, if you have paging sometimes the bot crawl just the first page and stops when it finds the next-page link keeps hitting the same document e.g. one index.php for the entire website.
You wouldn't want a bot to accidently hit some webpage that perform certain actions e.g. a "Delete topic" link that links to "delete.php?topicID=123" so most crawlers will check for those cases as well.
The Tools page at SEOmoz also provide a lot of information and insight about the way some crawlers work and what information it will extract and chew etc. You could use those to determine wether the pages deep inside your forum e.g. a year-old post might gets crawled or not.
And some crawlers enable you to customize their crawling behavior... something like Google Sitemaps. You could tell them to do-crawl and don't-crawl which pages and on which order etc. I remember there are such services available from MSN and Yahoo as well but have never tried it out myself.
You can throttle the crawling bot so it doesn't overwhelm your website by providing a robots.txt file in the website root.
Basically, if you design your forum so that the URLs doesn't look hostile to the crawlers, it'll merrily crawls the entire website.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With