Facebook crawler is hitting my server hard and ignoring directives. Accessing same resources multiple times

Tags:

The Facebook Crawler is hitting my servers multiple times every second and it seems to be ignoring both the Expires header and the og:ttl property.

In some cases, it is accessing the same og:image resource multiple times over the space of 1-5 minutes. In one example - the crawler accessed the same image 12 times over the course of 3 minutes using 12 different IP addresses.

I only had to log requests for 10 minutes before I caught the following example:

List of times and crawler IP addresses for one image:

2018-03-30 15:12:58 - 66.220.156.145
2018-03-30 15:13:13 - 66.220.152.7
2018-03-30 15:12:59 - 66.220.152.100
2018-03-30 15:12:18 - 66.220.155.248
2018-03-30 15:12:59 - 173.252.124.29
2018-03-30 15:12:15 - 173.252.114.118
2018-03-30 15:12:42 - 173.252.85.205
2018-03-30 15:13:01 - 173.252.84.117
2018-03-30 15:12:40 - 66.220.148.100
2018-03-30 15:13:10 - 66.220.148.169
2018-03-30 15:15:16 - 173.252.99.50
2018-03-30 15:14:50 - 69.171.225.134

What the og:image is according to Facebook's documentation:

The URL of the image that appears when someone shares the content to Facebook. See below for more info, and check out our best practices guide to learn how to specify a high quality preview image.

The images that I use in the og:image have an Expires header set to +7 days in the future. Lately, I changed that to +1 year in the future. Neither setting seems to make any difference. The headers that the crawler seems to be ignoring:

Cache-Control: max-age=604800
Content-Length: 31048
Content-Type: image/jpeg
Date: Fri, 30 Mar 2018 15:56:47 GMT
Expires: Sat, 30 Mar 2019 15:56:47 GMT
Pragma: public
Server: nginx/1.4.6 (Ubuntu)
Transfer-Encoding: chunked
X-Powered-By: PHP/5.5.9-1ubuntu4.23

According to Facebook's Object Properties documentation, the og:ttl property is:

Seconds until this page should be re-scraped. Use this to rate limit the Facebook content crawlers. The minimum allowed value is 345600 seconds (4 days); if you set a lower value, the minimum will be used. If you do not include this tag, the ttl will be computed from the "Expires" header returned by your web server, otherwise it will default to 7 days.

I have set this og:ttl property to 2419200, which is 28 days in the future.

I have been tempted to use something like this:

header("HTTP/1.1 304 Not Modified"); 
exit;

But my fear would be that Facebook's Crawler would ignore the header and mark the image as broken - thereby removing the image preview from the shared story.

A video showing the rate at which these requests from the Crawler are coming in.

Is there a way to prevent the crawler from coming back to hit these resources so soon?

Example code showing what my open graph and meta properties look like:

<meta property="fb:app_id" content="MyAppId" />
<meta property="og:locale" content="en_GB" />
<meta property="og:type" content="website" />
<meta property="og:title" content="My title" />
<meta property="og:description" content="My description" />
<meta property="og:url" content="http://example.com/index.php?id=1234" />
<link rel="canonical" href="http://example.com/index.php?id=1234" />
<meta property="og:site_name" content="My Site Name" />
<meta property="og:image" content="http://fb.example.com/img/image.php?id=123790824792439jikfio09248384790283940829044" />
<meta property="og:image:width" content="940"/>
<meta property="og:image:height" content="491"/>
<meta property="og:ttl" content="2419200" />

898

asked Mar 30 '18 16:03

Wayne Whitty

3 Answers

After I tried almost everything else with caching, headers and what not, the only thing that saved our servers from "overly enthusiastic" Facebook crawler (user agent facebookexternalhit) was simply denying the access and sending back HTTP/1.1 429 Too Many Requests HTTP response, when the crawler "crawled too much".

Admittedly, we had thousands of images we wanted the crawler to crawl, but Facebook crawler was practically DDOSing our server with tens of thousands of requests (yes, the same URLs over and over), per hour. I remember it was 40 000 requests per hour from different Facebook's IP addresses using te facebookexternalhit user agent at one point.

We did not want to block the the crawler entirely and blocking by IP address was also not an option. We only needed the FB crawler to back off (quite) a bit.

This is a piece of PHP code we used to do it:

.../images/index.php

<?php

// Number of requests permitted for facebook crawler per second.
const FACEBOOK_REQUEST_THROTTLE = 5;
const FACEBOOK_REQUESTS_JAR = __DIR__ . '/.fb_requests';
const FACEBOOK_REQUESTS_LOCK = __DIR__ . '/.fb_requests.lock';

function handle_lock($lockfile) {
    flock(fopen($lockfile, 'w'), LOCK_EX);
}

$ua = $_SERVER['HTTP_USER_AGENT'] ?? false;
if ($ua && strpos($ua, 'facebookexternalhit') !== false) {

    handle_lock(FACEBOOK_REQUESTS_LOCK);

    $jar = @file(FACEBOOK_REQUESTS_JAR);
    $currentTime = time();
    $timestamp = $jar[0] ?? time();
    $count = $jar[1] ?? 0;

    if ($timestamp == $currentTime) {
        $count++;
    } else {
        $count = 0;
    }

    file_put_contents(FACEBOOK_REQUESTS_JAR, "$currentTime\n$count");

    if ($count >= FACEBOOK_REQUEST_THROTTLE) {
        header("HTTP/1.1 429 Too Many Requests", true, 429);
        header("Retry-After: 60");
        die;
    }

}

// Everything under this comment happens only if the request is "legit". 

$filePath = $_SERVER['DOCUMENT_ROOT'] . $_SERVER['REQUEST_URI'];
if (is_readable($filePath)) {
    header("Content-Type: image/png");
    readfile($filePath);
}

You also need to configure rewriting to pass all requests directed at your images to this PHP script:

.../images/.htaccess (if you're using Apache)

RewriteEngine On
RewriteRule .* index.php [L]

It seems like the crawler "understood this" approach and effectively reduced the attempt rate from tens of thousands requests per hour to hundreds/thousands requests per hour.

126

answered Oct 24 '22 04:10

Smuuf

I received word back from the Facebook team themselves. Hopefully, it brings some clarification to how the crawler treats image URLs.

Here it goes:

The Crawler treats image URLs differently than other URLs.

We scrape images multiple times because we have different physical regions, each of which need to fetch the image. Since we have around 20 different regions, the developer should expect ~20 calls for each image. Once we make these requests, they stay in our cache for around a month - we need to rescrape these images frequently to prevent abuse on the platform (a malicious actor could get us to scrape a benign image and then replace it with an offensive one).

So basically, you should expect that the image specified in og:image will be hit 20 times after it has been shared. Then, a month later, it will be scraped again.

answered Oct 24 '22 06:10

Wayne Whitty

Sending blindly 304 Not Modified header does not have much sense and can confuse Facebook's crawler even more. If you really decide to just block some request you may consider 429 Too Many Requests header - it will at least clearly indicate what the problem is.

As a more gentle solution you may try:

Add Last-Modified header with some static value. Facebook's crawler may be clever enough to detect that for constantly changing content it should ignore Expires header but not clever enough to handle missing header properly.
Add ETag header with proper 304 Not Modified support.
Change Cache-Control header to max-age=315360000, public, immutable if the image is static.

You may also consider saving cached image and serving it via webserver without involving PHP. If you change URLs to something like http://fb.example.com/img/image/123790824792439jikfio09248384790283940829044 You can create fallback for nonexistent files by rewrite rules:

RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^img/image/([0-9a-z]+)$ img/image.php?id=$1 [L]

Only first request should be handled by PHP, which will save cache for requested URL (for example in /img/image/123790824792439jikfio09248384790283940829044). Then for all further requests webserver should take care of serving content from cached file, sending proper headers and handling 304 Not Modified. You may also configure nginx for rate limiting - it should be more efficient than delegating serving images to PHP.

answered Oct 24 '22 06:10

rob006

Related questions
                            
                                In codeigniter how to remove unwanted characters or symbols in string
                            
                                Working PHP code for Apple-news to create article
                            
                                constructor injection in symfony
                            
                                How to implement laravel-echo client into Vue project
                            
                                How to include php code in .tpl file
                            
                                Laravel Model casts not working properly
                            
                                How does one implement primary keys composed of UUIDs, instead of auto-incrementing integers, for Laravel Eloquent models and their relationships?
                            
                                Unable to install PHP zip archive with PHP 7.2
                            
                                Casting PHP into an array and looping through
                            
                                Neo4j PHP Graphaware '400 Bad Content-Type header' error
                            
                                Laravel command called from Laravel job
                            
                                What is the difference between EOT and HTML? in PHP
                            
                                Yii2: How to add two fields in orderby() of Find()
                            
                                How to tell PHP to use SameSite=None for cross-site cookies?
                            
                                How to configure Laravel mail.php to use built-in mail function?
                            
                                PHP session IDs -- how are they generated? [duplicate]
                            
                                How do I programmatically apply a Drupal input filter?
                            
                                Why does `catch (Exception $e)` not handle this `ErrorException`?
                            
                                Laravel Policies - How to Pass Multiple Arguments to function
                            
                                Correct PHP code to check if a variable exists

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Facebook crawler is hitting my server hard and ignoring directives. Accessing same resources multiple times

Tags:

php

facebook

facebook-graph-api

web-crawler

Wayne Whitty

People also ask

3 Answers

Smuuf

Wayne Whitty

rob006

Recent Activity

Donate For Us