I'm going to block all bots except the big search engines. One of my blocking methods will be to check for "language": Accept-Language: If it has no Accept-Language the bot's IP address will be blocked until 2037. Googlebot does not have Accept-Language, I want to verify it with DNS lookup
<?php
gethostbyaddr($_SERVER['REMOTE_ADDR']);
?>
Is it ok to use gethostbyaddr
, can someone pass my "gethostbyaddr protection"?
In the left-hand navigation, click Crawl and then select Fetch as Google. In the textbox, enter the path component of a URL on your site that you want Googlebot to retrieve. From the grey drop-down, choose the type of Googlebot with which you wish to perform a fetch (or fetch and render).
However, starting November 2020, Googlebot may crawl sites that may benefit from it over HTTP/2 if it's supported by the site. This may save computing resources (for example, CPU, RAM) for the site and Googlebot, but otherwise it doesn't affect indexing or ranking of your site.
Common Web Crawler Detection Methods Commonly used methods such as proper configuration in robots. txt files on server, whitelisting user-agent, among others, can detect and block some low level malicious crawlers.
"Crawler" (sometimes also called a "robot" or "spider") is a generic term for any program that is used to automatically discover and scan websites by following links from one webpage to another. Google's main crawler is called Googlebot.
function detectSearchBot($ip, $agent, &$hostname)
{
$hostname = $ip;
// check HTTP_USER_AGENT what not to touch gethostbyaddr in vain
if (preg_match('/(?:google|yandex)bot/iu', $agent)) {
// success - return host, fail - return ip or false
$hostname = gethostbyaddr($ip);
// https://support.google.com/webmasters/answer/80553
if ($hostname !== false && $hostname != $ip) {
// detect google and yandex search bots
if (preg_match('/\.((?:google(?:bot)?|yandex)\.(?:com|ru))$/iu', $hostname)) {
// success - return ip, fail - return hostname
$ip = gethostbyname($hostname);
if ($ip != $hostname) {
return true;
}
}
}
}
return false;
}
In my project, I use this function to identify Google and Yandex search bots.
The result of the detectSearchBot function is caching.
The algorithm is based on Google’s recommendation - https://support.google.com/webmasters/answer/80553
In addition to Cristian's answer:
function is_valid_google_ip($ip) {
$hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com"
return preg_match('/\.googlebot|google\.com$/i', $hostname);
}
function is_valid_google_request($ip=null,$agent=null){
if(is_null($ip)){
$ip=$_SERVER['REMOTE_ADDR'];
}
if(is_null($agent)){
$agent=$_SERVER['HTTP_USER_AGENT'];
}
$is_valid_request=false;
if (strpos($agent, 'Google')!==false && is_valid_google_ip($ip)){
$is_valid_request=true;
}
return $is_valid_request;
}
Note
Sometimes when using $_SERVER['HTTP_X_FORWARDED_FOR']
OR $_SERVER['REMOTE_ADDR']
more than 1 IP address is returned, for example '155.240.132.261, 196.250.25.120'. When this string is passed as an argument for gethostbyaddr()
PHP gives the following error:
Warning: Address is not a valid IPv4 or IPv6 address in...
To work around this I use the following code to extract the first IP address from the string and discard the rest. (If you wish to use the other IPs they will be in the other elements of the $ips array).
if (strstr($remoteIP, ', ')) {
$ips = explode(', ', $remoteIP);
$remoteIP = $ips[0];
}
https://www.php.net/manual/en/function.gethostbyaddr.php
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With