How can one detect a crawler / spider using PHP?
I'm currently working on a project where I need to keep track of each crawler's visit.
I know that you should use HTTP_USER_AGENT but I'm not really sure how to format the code for this purpose and i know that the USER AGENT can be changed very easy so i would also like to know if it is possible to add some more parameters to avoid spoofing?
Sample code of what i'm trying to do..
<?php
$user_agent = $_SERVER['HTTP_USER_AGENT'];
if (strpos( $user_agent, 'Google') !== false)
{
echo "Googlebot is here";
}
?>
Thank you
There are two methods of verifying the IP: Some search engines provide IP lists or ranges. You can verify the crawler by matching its IP with the provided list. You can perform a DNS look up to connect the IP address to the domain name.
PHP in Telugu A search engine directory of the spider names can be used as a reference. Next, $_SERVER['HTTP_USER_AGENT']; can be used to check if the agent is a spider (bot).
Alternatively, you can identify Googlebot by IP address by matching the crawler's IP address to the list of Googlebot IP addresses. For other Google IP addresses from where your site may be accessed (for example, by user request or Apps Scripts), match the accessing IP address against the list of Google IP addresses.
Verifying Googlebot the only official supported way to identify a google bot is to run a reverse DNS lookup on the accessing IP address and run a forward DNS lookup on the result to verify that it points to accessing IP address and the resulting domain name is in either googlebot.com or google.com domain.
According to Verifying Googlebot:
You can verify that a bot accessing your server really is Googlebot (or another Google user-agent) by using a reverse DNS lookup, verifying that the name is in the googlebot.com domain, and then doing a forward DNS lookup using that googlebot name. This is useful if you're concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.
For example:
host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer
crawl-66-249-66-1.googlebot.com.
host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard coded them. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).
You can do a reverse DNS lookup:
function validateGoogleBotIP($ip) {
$hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com"
return preg_match('/\.google(bot)?\.com$/i', $hostname);
}
if (strpos($_SERVER['HTTP_USER_AGENT'], 'Google') !== false) {
if (validateGoogleBotIP($_SERVER['REMOTE_ADDR'])) {
echo 'It is ACTUALLY google';
} else {
echo 'Someone\'s faking it!';
}
} else {
echo 'Nothing to do with Google';
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With