Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting number of views for a page ignoring search engines?

I notice that StackOverflow has a views count for each question and that these view numbers are fairly low and accurate.

I have a similar thing on one of my sites. It basically logs a "hit" whenever the page is loaded in the backend code. Unfortunately it also does this for search engine hits giving bloated and inaccurate numbers.

I guess one way to not count a robot would be to do the view counting with an AJAX call once the page has loaded, but I'm sure there's other, better ways to ignore search engines in your hit counters whilst still letting them in to crawl your site. Do you know any?

like image 553
David McLaughlin Avatar asked Sep 05 '08 13:09

David McLaughlin


2 Answers

An AJAX call will do it, but usually search engines will not load images, javascript or CSS files, so it may be easier to include one of those files in the page, and pass the URL of the page you want to log a request against as a parameter in the file request.

For example, in the page...

http://www.example.com/example.html

You might include in the head section

<link href="empty.css?log=example.html" rel="stylesheet" type="text/css" />

And have your server side log the request, then return an empty css file. The same approach would apply to JavaScript or and image file, though in all cases you'll want to look carefully at what caching might take place.

Another option would be to eliminate the search engines based on their user agent. There's a big list of possible user agents at http://user-agents.org/ to get you started. Of course, you could go the other way, and only count requests from things you know are web browsers (covering IE, Firefox, Safari, Opera and this newfangled Chrome thing would get you 99% of the way there).

Even easier would be to use a log analytics tool like awstats or a service like Google analytics, both of which have already solved this problem.

like image 176
Matt Sheppard Avatar answered Nov 19 '22 11:11

Matt Sheppard


To solve this problem I implemented a simple filter that would look at the User-Agent header in the HTTP request and compare it to a list of known robots.

I got the robot list from www.robotstxt.org. It's downloadable in a simple text-format that can easily be parsed to auto-generate the "blacklist".

like image 2
Anders Sandvig Avatar answered Nov 19 '22 11:11

Anders Sandvig