Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Verifying a Googlebot

I'm going to block all bots except the big search engines. One of my blocking methods will be to check for "language": Accept-Language: If it has no Accept-Language the bot's IP address will be blocked until 2037. Googlebot does not have Accept-Language, I want to verify it with DNS lookup

<?php
gethostbyaddr($_SERVER['REMOTE_ADDR']);
?>

Is it ok to use gethostbyaddr, can someone pass my "gethostbyaddr protection"?

like image 272
ilhan Avatar asked Jun 20 '10 01:06

ilhan


People also ask

How do I activate Googlebot?

In the left-hand navigation, click Crawl and then select Fetch as Google. In the textbox, enter the path component of a URL on your site that you want Googlebot to retrieve. From the grey drop-down, choose the type of Googlebot with which you wish to perform a fetch (or fetch and render).

Can Googlebot crawl my site?

However, starting November 2020, Googlebot may crawl sites that may benefit from it over HTTP/2 if it's supported by the site. This may save computing resources (for example, CPU, RAM) for the site and Googlebot, but otherwise it doesn't affect indexing or ranking of your site.

How do you identify a crawler?

Common Web Crawler Detection Methods Commonly used methods such as proper configuration in robots. txt files on server, whitelisting user-agent, among others, can detect and block some low level malicious crawlers.

Is Googlebot a Google spider?

"Crawler" (sometimes also called a "robot" or "spider") is a generic term for any program that is used to automatically discover and scan websites by following links from one webpage to another. Google's main crawler is called Googlebot.


2 Answers

function detectSearchBot($ip, $agent, &$hostname)
{
    $hostname = $ip;

    // check HTTP_USER_AGENT what not to touch gethostbyaddr in vain
    if (preg_match('/(?:google|yandex)bot/iu', $agent)) {
        // success - return host, fail - return ip or false
        $hostname = gethostbyaddr($ip);

        // https://support.google.com/webmasters/answer/80553
        if ($hostname !== false && $hostname != $ip) {
            // detect google and yandex search bots
            if (preg_match('/\.((?:google(?:bot)?|yandex)\.(?:com|ru))$/iu', $hostname)) {
                // success - return ip, fail - return hostname
                $ip = gethostbyname($hostname);

                if ($ip != $hostname) {
                    return true;
                }
            }
        }
    }

    return false;
}

In my project, I use this function to identify Google and Yandex search bots.

The result of the detectSearchBot function is caching.

The algorithm is based on Google’s recommendation - https://support.google.com/webmasters/answer/80553

like image 183
Worka Avatar answered Nov 14 '22 22:11

Worka


In addition to Cristian's answer:

function is_valid_google_ip($ip) {
    
    $hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com"
    
    return preg_match('/\.googlebot|google\.com$/i', $hostname);
}

function is_valid_google_request($ip=null,$agent=null){
    
    if(is_null($ip)){
        
        $ip=$_SERVER['REMOTE_ADDR'];
    }
    
    if(is_null($agent)){
        
        $agent=$_SERVER['HTTP_USER_AGENT'];
    }
    
    $is_valid_request=false;

    if (strpos($agent, 'Google')!==false && is_valid_google_ip($ip)){
        
        $is_valid_request=true;
    }
    
    return $is_valid_request;
}

Note

Sometimes when using $_SERVER['HTTP_X_FORWARDED_FOR'] OR $_SERVER['REMOTE_ADDR'] more than 1 IP address is returned, for example '155.240.132.261, 196.250.25.120'. When this string is passed as an argument for gethostbyaddr() PHP gives the following error:

Warning: Address is not a valid IPv4 or IPv6 address in...

To work around this I use the following code to extract the first IP address from the string and discard the rest. (If you wish to use the other IPs they will be in the other elements of the $ips array).

if (strstr($remoteIP, ', ')) {
    $ips = explode(', ', $remoteIP);
    $remoteIP = $ips[0];
}

https://www.php.net/manual/en/function.gethostbyaddr.php

like image 23
RafaSashi Avatar answered Nov 14 '22 22:11

RafaSashi