Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to redirect crawlers requests to pre-rendered pages when using Amazon S3?

Problem

I have a static SPA site built with Angular and hosted on Amazon S3. I'm trying to make my pre-rendered pages accessible by crawlers, but I can't redirect the crawlers requests since Amazon S3 does not offer a URL Rewrite option and the Redirect rules are limited.

What I have

I've added the following meta-tag to the <head> of my index.html page:

<meta name="fragment" content="!">

Also, my SPA is using pretty URLs (without the hash # sign) with HTML5 push state.

With this setup, when a crawler finds my http://mywebsite.com/about link, it will make a GET request to http://mywebsite.com/about?_escaped_fragment_=. This is a pattern defined by Google and followed by others crawlers.

What I need is to answer this request with a pre-rendered version of the about.html file. I've already done this pre-rendering with Phantom.js, but I can't serve the correct file to crawlers because Amazon S3 do not have a rewrite rule.

In a nginx server, the solution would be to add a rewrite rule like:

location / {
  if ($args ~ "_escaped_fragment_=") { 
    rewrite ^/(.*)$ /snapshots/$1.html break; 
  } 
} 

But in Amazon S3, I'm limited by their redirect rules based on KeyPrefixes and HttpErrorCodes. The ?_escaped_fragment_= is not a KeyPrefix, since it appears at the end of the URL, and it gives no HTTP error since Angular will ignore it.

What I've tried

I've started trying using dynamic templates with ngRoute, but later I've realized that I can't solve this with any Angular solution since I'm targeting crawlers that can't execute JavaScript.

With Amazon S3, I have to stick with their redirect rules.

I've managed to get it working with an ugly workaround. If I create a new rule for each page, I'm done:

<RoutingRules>

  <!-- each page needs it own rule -->
  <RoutingRule>
    <Condition>
      <KeyPrefixEquals>about?_escaped_fragment_=</KeyPrefixEquals>
    </Condition>
    <Redirect>
      <HostName>mywebsite.com</HostName>
      <ReplaceKeyPrefixWith>snapshots/about.html</ReplaceKeyPrefixWith>
    </Redirect>
  </RoutingRule>

</RoutingRules>

As you can see in this solution, each page will need its own rule. Since Amazon limits to only 50 redirect rules, this is not a viable solution.

Another solution would be to forget about pretty URLs and use hashbangs. With this, my link would be http://mywebsite.com/#!about and crawlers would request this with http://mywebsite.com/?_escaped_fragment_=about. Since the URL will start with ?_escaped_fragment_=, it can be captured with the KeyPrefix and just one redirect rule would be enough. However, I don't want to use ugly URLs.

So, how can I have a static SPA in Amazon S3 and be SEO-friendly?

like image 256
Zanon Avatar asked Sep 07 '15 00:09

Zanon


1 Answers

Short Answer

Amazon S3 (and Amazon CloudFront) does not offer rewrite rules and have only limited redirect options. However, you don't need to redirect or rewrite your URL requests. Just pre-render all HTML files and upload them following your website paths.

Since a user browsing the webpage has JavaScript enabled, Angular will be triggered and will take control over the page which results into a re-rendering of the template. With this, all Angular functionalities will be available for this user.

Regarding the crawler, the pre-rendered page will be enough.


Example

If you have a website named www.myblog.com and a link to another page with the URL www.myblog.com/posts/my-first-post. Probably, your Angular app has the following structure: an index.html file that is in your root directory and is responsible for everything. The page my-first-post is a partial HTML file located in /partials/my-first-post.html.

The solution in this case is to use a pre-rendering tool at deploy time. You can use PhantomJS for this, but you can't use a middleware tool like Prerender because you have a static site hosted in Amazon S3.

You need to use this pre-render tool to create two files: index.html and my-first-post. Note that my-first-post will be an HTML file without the .html extension, but you will need to set its Content-Type to text/html when you upload to Amazon S3.

You will place the index.html file in your root directory and my-first-post inside a folder named posts to match your URL path /posts/my-first-post.

With this approach, the crawler will be able to retrieve your HTML file and the user will be happy to use all Angular functionalities.


Note: this solution requires that all files be referenced using the root path. Relative paths will not work if you visit the link www.myblog.com/posts/my-first-post.

By root path, I mean:

<script src="/js/myfile.js"></script>

The wrong way, using relative paths, would be:

<script src="js/myfile.js"></script>


EDIT:

Below follows a small JavaScript code that I've used to prerender pages using PhantomJS. After installing PhantomJS and testing the script with a single page, add to your build process a script to prerender all pages before deploying your site.

var fs = require('fs');
var webPage = require('webpage');
var page = webPage.create();

// since this tool will run before your production deploy, 
// your target URL will be your dev/staging environment (localhost, in this example)
var path = 'pages/my-page';
var url = 'http://localhost/' + path;

page.open(url, function (status) {

  if (status != 'success')
    throw 'Error trying to prerender ' + url;

  var content = page.content;
  fs.write(path, content, 'w');

  console.log("The file was saved.");
  phantom.exit();
});

Note: it looks like Node.js, but it isn't. It must be executed with Phantom executable and not Node.

like image 97
Zanon Avatar answered Sep 19 '22 10:09

Zanon