Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Facebook and Crawl-delay in Robots.txt?

Does Facebook's webcrawling bots respect the Crawl-delay: directive in robots.txt files?

like image 770
artlung Avatar asked Oct 10 '11 17:10

artlung


2 Answers

For a similar question, I offered a technical solution that simply rate-limits load based on the user-agent.

Code repeated here for convenience:

Since one cannot appeal to their hubris, and DROP'ing their IP block is pretty draconian, here is my technical solution.

In PHP, execute the following code as quickly as possible for every request.

define( 'FACEBOOK_REQUEST_THROTTLE', 2.0 ); // Number of seconds permitted between each hit from facebookexternalhit

if( !empty( $_SERVER['HTTP_USER_AGENT'] ) && preg_match( '/^facebookexternalhit/', $_SERVER['HTTP_USER_AGENT'] ) ) {
    $fbTmpFile = sys_get_temp_dir().'/facebookexternalhit.txt';
    if( $fh = fopen( $fbTmpFile, 'c+' ) ) {
        $lastTime = fread( $fh, 100 );
        $microTime = microtime( TRUE );
        // check current microtime with microtime of last access
        if( $microTime - $lastTime < FACEBOOK_REQUEST_THROTTLE ) {
            // bail if requests are coming too quickly with http 503 Service Unavailable
            header( $_SERVER["SERVER_PROTOCOL"].' 503' );
            die;
        } else {
            // write out the microsecond time of last access
            rewind( $fh );
            fwrite( $fh, $microTime );
        }
        fclose( $fh );
    } else {
        header( $_SERVER["SERVER_PROTOCOL"].' 503' );
        die;
    }
}
like image 82
Stickley Avatar answered Oct 17 '22 16:10

Stickley


No, it doesn't respect robots.txt

Contrary to other answers here, facebookexternalhit behaves like the meanest of crawlers. Whether it got the urls it requests from crawling or from like buttons doesn't matter so much when it goes through every one of those at an insane rate.

We sometimes get several hundred hits per second as it goes through almost every url on our site. It kills our servers every time. The funny thing is that when that happens, we can see that Googlebot slows down and waits for things to settle down before slowly ramping back up. facebookexternalhit, on the other hand, just continues to pound our servers, often harder than the initial bout that killed us.

We have to run much beefier servers than we actually need for our traffic, just because of facebookexternalhit. We've done tons of searching and can't find a way to slow them down.

How is that a good user experience, Facebook?

like image 14
Branton Davis Avatar answered Oct 17 '22 16:10

Branton Davis