Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Block bingbot from crawling my site

I would like t completely block bing from crawling my site for now (its attacking my site at an alarming rate (500GB of data a month).

I have 1000 sub domains added to bing webmaster tools so i cant go and set each one's crawl rate. I have tried blocking it using robots.txt but its not working here is my robots.txt

# robots.txt 
User-agent: *
Disallow:
Disallow: *.axd
Disallow: /cgi-bin/
Disallow: /member
Disallow: bingbot
User-agent: ia_archiver
Disallow: /
like image 363
Zoinky Avatar asked Nov 28 '14 12:11

Zoinky


People also ask

How do I stop Google bots on my website?

You can prevent a page or other resource from appearing in Google Search by including a noindex meta tag or header in the HTTP response. When Googlebot next crawls that page and sees the tag or header, Google will drop that page entirely from Google Search results, regardless of whether other sites link to it.

How do I restrict web crawlers?

Make Some of Your Web Pages Not Discoverable Adding a “no index” tag to your landing page won't show your web page in search results. Search engine spiders will not crawl web pages with “disallow” tags, so you can use this type of tag, too, to block bots and web crawlers.

How do you stop robots from looking at things on a website?

To prevent specific articles on your site from being indexed by all robots, use the following meta tag: <meta name="robots" content="noindex, nofollow">. To prevent robots from crawling images on a specific article, use the following meta tag: <meta name="robots" content="noimageindex">.


2 Answers

This WILL definitely affect your SEO/search ranking and will cause pages to drop from the index so please use with care

You can block requests based on the user-agent string if you have the iis rewrite module installed (if not go here)

And then add a rule to your webconfig like this:

<system.webServer>
  <rules>
    <rule name="Request Blocking Rule" stopProcessing="true">
      <match url=".*" />
      <conditions>
        <add input="{HTTP_USER_AGENT}" pattern="msnbot|BingBot" />
      </conditions>
      <action type="CustomResponse" statusCode="403" statusReason="Forbidden: Access is denied." statusDescription="You do not have permission to view this page." />
    </rule>
  </rules>
</system.webServer>

This will return a 403 if the bot hits your site.

UPDATE

Looking at your robots.txt i think it should be:

# robots.txt 
User-agent: *
Disallow:
Disallow: *.axd
Disallow: /cgi-bin/
Disallow: /member
User-agent: bingbot
Disallow: /
User-agent: ia_archiver
Disallow: /
like image 74
Carl Avatar answered Sep 23 '22 06:09

Carl


Your robots.txt is not correct:

  • You need line breaks between records (a record starts with one or more User-agent lines).

  • Disallow: bingbot disallows crawling of URLs whose paths start with "bingbot" (i.e., http://example.com/bingbot), which is probably not what you want.

  • Not an error, but Disallow: is not needed (as it’s the default anyway).

So you probably want to use:

User-agent: *
Disallow: *.axd
Disallow: /cgi-bin/
Disallow: /member

User-agent: bingbot
User-agent: ia_archiver
Disallow: /

This disallows crawling of anything for "bingbot" and "ia_archiver". All other bots are allowed to crawl everything except URLs whose paths start with /member, /cgi-bin/, or *.axd.

Note that *.axd will be interpreted literally by bots following the original robots.txt specification (so they will not crawl http://example.com/*.axd, but they will crawl http://example.com/foo.axd). However, many bots extend the spec and interpret the * as some kind of wildcard.

like image 24
unor Avatar answered Sep 22 '22 06:09

unor