Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to detect search engine visites on my site? like phpBB

Is there any way to detect search engines or crawlers on my site. i have seen in phpBB at the admin we can see and allow search engines and also we can see the last visit of the bot(like Google Bot).

any script in PHP? Not Google Analytic or same kind of application. i need to implement that for my blog site, i think there is some way to find out?

like image 695
coderex Avatar asked Jul 20 '09 16:07

coderex


2 Answers

You can go by either IP addresses or the 'User-Agent' string that the bot or web browser sends you.

When Googlebot (or most other well-behaving robots) visit your website, they'll send you a $_SERVER['HTTP_USER_AGENT'] variable which identifies what they are. Some examples are:

Googlebot/2.1 (+http://www.google.com/bot.html)

NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html

Baiduspider+(+http://www.baidu.com/search/spider_jp.html)

Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/531.4 (KHTML, like Gecko)

You can find many more examples at these websites: link text link text

You could then use PHP to examine those user-agent strings and determine if the user is a search engine or not. I use something like this often:

$searchengines = array(
    'Googlebot', 
    'Slurp', 
    'search.msn.com', 
    'nutch', 
    'simpy', 
    'bot', 
    'ASPSeek', 
    'crawler', 
    'msnbot', 
    'Libwww-perl', 
    'FAST', 
    'Baidu', 
    );
$is_se = false;
foreach ($searchengines as $searchengine){
   if (!empty($_SERVER['HTTP_USER_AGENT']) and 
            false !== strpos(strtolower($_SERVER['HTTP_USER_AGENT']), strtolower($searchengine)))
    {
            $is_se = true;
            break;
    }
}
if ($is_se) { print('Its a search engine!'); } 

Remember that no detection method (Google Analytics or another statistics package or otherwise) is going to be 100% accurate. Some web browsers allow you to set a custom user-agent string, and some misbehaving web crawlers may not send a user-agent string at all. This method can be probably effective for 95%+ of crawlers/visitors though.

like image 128
Keith Palmer Jr. Avatar answered Oct 11 '22 11:10

Keith Palmer Jr.


  1. You can try to detect them using their user-agent string. A list of them can be found here: http://www.botsvsbrowsers.com/

    Search engines tend to use the words crawler and robot.

  2. Search engines are almost the only internet user that visit robots.txt.

  3. There are some IPs known to be bots like the GoogleBot.

like image 21
Georg Schölly Avatar answered Oct 11 '22 13:10

Georg Schölly