Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Googlebot Unexplained 32-character hexadecimal appended string causing more than 20,000 404 errors per day

I have a very interesting problem that I am failing to explain.

Every 2 to 6 seconds googlebot (I have looked up googlebots IP, its the real thing [using host IP]) is requesting a page on our site (running: php, apache, mongodb) that does not exist (404s). No other robot or human has ever requested a page like this! Just googlebot.

The requests each look something like this:

/2de4f853c2853807b2e72387aa8928a4

/ea5700c343d1a9798bc554af7c1a330e

/e5aafa102d54ba7517703336846cc019

Our code does not use any 32 char strings and there are no links anything like that internal or external of our site. We use codeigniter so at first I thought it was the default session_id, i have checked, it is not.

Has anyone ever seen anything like this? Our website uses history.push on some pages, could this cause it? Just an idea.

Raw Data of an example request:

array (
  'date' => '2012-12-01',
  'time' => '10:01:33 PM',
  'additional_data' => 
    array (
      'server_vars' => 
        array (
          'REDIRECT_STATUS' => '200',
          'HTTP_HOST' => 'www.xxxxxxx.com',
          'HTTP_ACCEPT' => '*/*',
          'HTTP_ACCEPT_ENCODING' => 'gzip,deflate',
          'HTTP_FROM' => 'googlebot(at)googlebot.com',
          'HTTP_USER_AGENT' => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
          'HTTP_X_FORWARDED_FOR' => 'xxxxxxx',
          'HTTP_X_FORWARDED_PORT' => '80',
          'HTTP_X_FORWARDED_PROTO' => 'http',
          'HTTP_CONNECTION' => 'keep-alive',
          'PATH' => '/sbin:/usr/sbin:/bin:/usr/bin:/home/ec2-user/ec2/bin',
          'SERVER_SIGNATURE' => '<address>Apache/2.2.22 (Amazon) Server at www.xxxxxxx.com Port 80</address>
',
          'SERVER_SOFTWARE' => 'Apache/2.2.22 (Amazon)',
          'SERVER_NAME' => 'www.xxxxxxx.com',
          'SERVER_ADDR' => 'xxxxxxxxxx',
          'SERVER_PORT' => '80',
          'REMOTE_ADDR' => '10.171.147.114',
          'REMOTE_PORT' => '40759',
          'REDIRECT_URL' => '/e5aafa102d54ba7517703336846cc019',
          'GATEWAY_INTERFACE' => 'CGI/1.1',
          'SERVER_PROTOCOL' => 'HTTP/1.1',
          'REQUEST_METHOD' => 'GET',
          'QUERY_STRING' => '',
          'REQUEST_URI' => '/e5aafa102d54ba7517703336846cc019',
          'SCRIPT_NAME' => '/index.php',
          'PATH_INFO' => '/e5aafa102d54ba7517703336846cc019',
          'PATH_TRANSLATED' => 'redirect:/index.php/e5aafa102d54ba7517703336846cc019',
          'PHP_SELF' => '/index.php/e5aafa102d54ba7517703336846cc019',
          'REQUEST_TIME' => 1354428093,
       ),
    'codeigiter_session' => 
      array (
        'session_id' => 'c795e40a279f58d9fbbf7f5501a26787',
        'ip_address' => '10.171.147.114',
        'user_agent' => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
        'last_activity' => 1354428093,
        'user_data' => '',
    ),
  ),
)

What else can I collect to figure this out. Its very strange.


Update: The traffic is coming from 2 primary ip addresses. 10.171.147.114 & 10.161.46.102

I have looked these up and they are not GoogleBot.

I have gotten this info from one IP lookup site.

Remember that IP address ranges 10.0.0.0 – 10.255.255.255, 172.16.0.0 – 172.31.255.255, 192.168.0.0 – 192.168.255.255 and 224.0.0.0 - 239.255.255.255 are reserved IP Addresses for private internet use and IP lookup for these will not return any results.

What should / can I do about these requests? What is the point of these requests? If this is a type of DOS attack they are doing a very bad job at it.

like image 268
RonSper Avatar asked Dec 02 '12 06:12

RonSper


1 Answers

To answer this question, the problem was being created by the aws load blancer's health checks. For some reason aws is using the googlebot user_agent to perform them on our servers.

like image 93
RonSper Avatar answered Oct 21 '22 04:10

RonSper