Did Facebook just implement some web crawler? My website has been crashing a couple times over the past few days, severely overloaded by IPs that I've traced back to Facebook.
I have tried googling around but can't find any definitive resource regarding controling Facebook's crawler bot via robots.txt. There is a reference on adding the following:
User-agent: facebookexternalhit/1.1 Crawl-delay: 5
User-agent: facebookexternalhit/1.0 Crawl-delay: 5
User-agent: facebookexternalhit/* Crawl-delay: 5
But I can't find any specific reference on whether Facebook bot respects the robots.txt. According to older sources, Facebook "does not crawl your site". But this is definitely false, as my server logs showed them crawling my site from a dozen+ IPs from the range of 69.171.237.0/24 and 69.171.229.115/24 at the rate of many pages each second.
And I can't find any literature on this. I suspect it is something new that FB just implemented over the past few days, due to my server never crashing previously.
Can someone please advice?
As discussed in in this similar question on facebook and Crawl-delay, facebook does not consider itself a bot, and doesn't even request your robots.txt, much less pay attention to it's contents.
You can implement your own rate limiting code as shown in the similar question link. The idea is to simply return http code 503 when you server is over capacity, or being inundated by a particular user-agent.
It appears those working for huge tech companies don't understand "improve your caching" is something small companies don't have budgets to handle. We are focused on serving our customers that actually pay money, and don't have time to fend off rampaging web bots from "friendly" companies.
We saw the same behaviour at about the same time (mid October) - floods of requests from Facebook that caused queued requests and slowness across the system. To begin with it was every 90 minutes; over a few days this increased in frequency and became randomly distributed.
The requests appeared not to respect robots.txt, so we were forced to think of a different solution. In the end we set up nginx to forward all requests with a facebook useragent to a dedicated pair of backend servers. If we were using nginx > v0.9.6 we could have done a nice regex for this, but we weren't, so we used a mapping along the lines of
    map $http_user_agent $fb_backend_http {
             "facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)"
                    127.0.0.1:80;
     }
This has worked nicely for us; during the couple of weeks that we were getting hammered this partitioning of requests kept the heavy traffic away from the rest of the system.
It seems to have largely died down for us now - we're just seeing intermittent spikes.
As to why this happened, I'm still not sure - there seems to have been a similar incident in April that was attributed to a bug http://developers.facebook.com/bugs/409818929057013/ but I'm not aware of anything similar more recently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With