Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to stop bots from crawling my AJAX-based URL's?

I've got several pages on my ASP.NET MVC 3 website (not that the technology matters here), where i render out certain URL's in a <script> tag on the page, so that my JavaScript (stored in an external file) can perform AJAX calls to the server.

Something like this:

<html>
   ...
   <body>
      ...
      <script type="text/javascript">
         $(function() {
            myapp.paths.someUrl = '/blah/foo'; // not hardcoded in reality, but N/A here
         });
      </script>
   </body>
</html>

Now on the server-side, most of these URL's are protected with attributes stating that:

a) They can only be accessed by AJAX (e.g XmlHttpRequest)

b) They can only be accessed by HTTP POST (as it returns JSON - security)

The problem is, for some reason, bots are crawling these URL's, and trying to do HTTP GET's on them, resulting in 404's.

I was under the impression that bots shouldn't try and crawl javascript. So how are they getting a hold of these URL's?

Is there any way i can prevent them from doing this?

I can't really move these URL variables to an external file, because as the comment in the code above suggests, i render the URL's out with server-code (must be done on the actual page).

I've basically been added routing to my website to HTTP 410 (Gone) these URL's (when it's not a AJAX POST). Which is really annoying, because it's adding another route to my already convuluted route table.

Any tips/suggestions?

like image 685
RPM1984 Avatar asked Mar 25 '12 23:03

RPM1984


People also ask

How do you stop robots from looking at things on a website?

To prevent specific articles on your site from being indexed by all robots, use the following meta tag: <meta name="robots" content="noindex, nofollow">. To prevent robots from crawling images on a specific article, use the following meta tag: <meta name="robots" content="noimageindex">.

Can Google crawl AJAX?

Why does Google no longer support the AJAX crawling scheme? Google stopped officially crawling #! URLs in the summer of 2018. They have stopped supporting this scheme as Googlebot can now render AJAX websites using the web rendering service (WRS).


1 Answers

Disallow URL by the prefix in the robots.txt

like image 86
Eugene Retunsky Avatar answered Oct 23 '22 13:10

Eugene Retunsky