Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to collect page views while excluding bots and crawlers in 2016?

We want to add page views counters to our articles pages (just like in Stackoverflow), but we don't want to add page views of bots and crawlers.

I searched quite a bit, and only found very obsolete answers which say to fire an AJAX request, since crawlers and bots don't execute javascript... Well, it's 2016... I believe all the major crawlers execute javascript nowadays.

I thought about two viable solutions:

  1. Keep a list of all known bots and crawlers User Agents on the server, and only increase the counter in case the request isn't of one of them (seems like a very bad solution since the list needs to be maintained and updated regularly, and probably there will be many that the list won't catch).
  2. Use AJAX to send a request to an endpoint that is disallowed in robots.txt. (or a hidden image with a src="/article/track/?id=xxxxx")

The second option creates another request per page, not horrible, but maybe there's a better way? What is the common way of handling this today?

Using ASP.NET Core and storing the page views in redis if it matters

like image 803
gdoron is supporting Monica Avatar asked Oct 12 '16 21:10

gdoron is supporting Monica


2 Answers

I found out how Stackoverflow themselves handle it:

<script>
    StackExchange.ready(function(){$.get('/posts/40008735/ivc/e079');});
</script>
<noscript>
    <div>
        <img src="/posts/40008735/ivc/e079" class="dno" alt="" width="0" height="0">
    </div>
</noscript>

And in robots.txt:

Disallow: /*/ivc/*
...
User-agent: Googlebot-Image
Disallow: /*/ivc/*

So basically, they handle it as I suggested in option 2:

Issue an AJAX request (or with a hidden img in case javascript is disabled) and instruct crawlers and bots to not crawl that URL with Disallow.

like image 82
gdoron is supporting Monica Avatar answered Oct 20 '22 23:10

gdoron is supporting Monica


As I mentioned on chat, you could cache the IP address of the client when it requests /robots.txt.

On other requests, check if the IP address is in the cache and don't count it as a page view if it is.

like image 39
Oliver Salzburg Avatar answered Oct 20 '22 23:10

Oliver Salzburg