Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to hide a page url from bots/spiders?

Tags:

php

On my website, I have 1000 products, and they all have their own web page which are accessible by something like product.php?id=PRODUCT_ID.

On all of these pages, I have a link which has a url action.php?id=PRODUCT_ID&referer=CURRNT_PAGE_URL .. so if I am visiting product.php?id=100 this url becomes action.php?prod_id=100&referer=/product.php?id=1000 clicking on this url returns the user back to referer

Now, the problem I am facing is that I keep getting false hits from spiders. Is there any way by which I can avoid these false hits? I know I can "diallow" this url in robots.txt but still there are bots who ignore this. What would you recommend? Any ideas are welcome. Thanks

like image 725
Kay Avatar asked Mar 25 '11 11:03

Kay


2 Answers

Currently, the easiest way of making a link inaccessible to 99% of robots (even those that choose to ignore robots.txt) is with Javascript. Add some unobtrusive jQuery:

<script type="text/javascript">
$(document).ready(function() {
    $('a[data-href]').attr('href', $(this).attr('data-href'));
  });
});
</script>

The construct your links in the following fashion.

<a href="" rel="nofollow" data-href="action.php?id=PRODUCT_ID&referrer=REFERRER">Click me!</a>

Because the href attribute is only written after the DOM is ready, robots won't find anything to follow.

like image 179
cantlin Avatar answered Oct 01 '22 12:10

cantlin


Your problem consists of 2 separate issues:

  1. multiple URLs lead to the same resource
  2. crawlers don't respect robots.txt

The second issue is hard to tackle, read Detecting 'stealth' web-crawlers

The first one is easier. You seem to need an option to let the user go back to the previous page.

I'm not sure why you do not let the browser's history take care of this (through the use of the back-button and javascript's history.back();), but there are enough valid reasons out there.

Why not use the refferer header?
Almost all common browser send information about the referring page with every request. It can be spoofed, but for the mayority of visitors this should be a working solution.

Why not use a cookie?
If you store the CURRNT_PAGE_URL in a cookie, you can still use a single unique URLs for each page, and still dynamically create breadcrumbs and back links based on the refferer set in the cookie, and not be dependent on the HTTP-referrer value.

like image 27
Jacco Avatar answered Oct 01 '22 11:10

Jacco