Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Purely JavaScript Solution for Google Ajax Crawlable Spec

I have a project which is heavily JavaScript based (e.g. node.js, backbone.js, etc.). I'm using hashbang urls like /#!/about and have read the google ajax crawlable spec. I've done a wee bit of headless UI testing with zombie and can easily conceive of how this could be done by setting a slight delay and returning static content back to the google bot. But I don't really want to implement this from scratch and was hoping there was a pre-existing library that fits in with my stack. Know of one?

EDIT: At time of writing I don't think this exists. However, rendering using backbone (or similar) on server and client is a plausible approach (even if not a direct answer). So I'm going to mark that as answer although there may be better solutions in the future.

like image 310
Rob Avatar asked Jan 19 '12 17:01

Rob


3 Answers

Just to chime in, I ran into this issue too (I have very ajax/js heavy site), and I found this which may be of interest:

crawlme

I have yet to try it but it sounds like it will make the whole process a piece of cake if it works as advertised! it's a piece of connect/express middleware that is simply inserted before any calls to pages, and apparently takes care of the rest.

Edit:

Having tried crawlme, I had some success, but the backend headless browser it uses (zombie.js) was failing with some of my javascript content, likely because it works by emulting the DOM and thus won't be perfect.

Sooo, instead I got hold of a full webkit based headless browser, phantomjs, and a set of node linkings for it, like this:

npm install phantomjs node-phantom

I then created my own script similar to crawlme, but using phantomjs instead of zombie.js. This approach seems to work perfectly, and will render every single one of my ajax based pages perfectly. the script I wrote to pull this off can be found here. to use it, simply:

var googlebot = require("./path-to-file");

and then before any other calls to your app (this is using express but should work with just connect too:

app.use(googlebot());

the source is realtively simple minus a couple of regexps, so have a gander :)

Result: AJAX heavy node.js/connect/express based website can be crawled by the googlebot.

like image 68
jsdw Avatar answered Nov 12 '22 05:11

jsdw


There is one implementation using node.js and Backbone.js on the server and browser https://github.com/Morriz/backbone-everywhere

like image 41
opengrid Avatar answered Nov 12 '22 05:11

opengrid


crawleable nodejs module seems to fit this purpose: https://npmjs.org/package/crawlable and example of such SPA that can be rendered server-side in node https://github.com/trupin/crawlable-todos

like image 41
Guillaume Berche Avatar answered Nov 12 '22 05:11

Guillaume Berche