Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cloudfront Custom Origin Is Causing Duplicate Content Issues

I am using CloudFront to serve images, css and js files for my website using the custom origin option with subdomains CNAMEd to my account. It works pretty well.

Main site: www.mainsite.com

  1. static1.mainsite.com
  2. static2.mainsite.com

Sample page: www.mainsite.com/summary/page1.htm

This page calls an image from static1.mainsite.com/images/image1.jpg

If Cloudfront has not already cached the image, it gets the image from www.mainsite.htm/images/image1.jpg

This all works fine.

The problem is that google alert has reported the page as being found at both:

  • http://www.mainsite.com/summary/page1.htm
  • http://static1.mainsite.com/summary/page1.htm

The page should only be accessible from the www. site. Pages should not be accessible from the CNAME domains.

I have tried to put a mod rewrite in the .htaccess file and I have also tried to put a exit() in the main script file.

But when Cloudfront does not find the static1 version of the file in its cache, it calls it from the main site and then caches it.

Questions then are:

1. What am I missing here?
2. How do I prevent my site from serving pages instead of just static components to cloudfront?
3. How do I delete the pages from cloudfront? just let them expire?

Thanks for your help.

Joe

like image 805
Joe Boxer Avatar asked Jan 06 '12 04:01

Joe Boxer


People also ask

Can CloudFront have multiple origins?

You can configure a single CloudFront web distribution to serve different types of requests from multiple origins.

What is custom origin in CloudFront?

A custom origin is an HTTP server, for example, a web server. The HTTP server can be an Amazon EC2 instance or an HTTP server that you host somewhere else. An Amazon S3 origin configured as a website endpoint is also considered a custom origin.


2 Answers

[I know this thread is old, but I'm answering it for people like me who see it months later.]

From what I've read and seen, CloudFront does not consistently identify itself in requests. But you can get around this problem by overriding robots.txt at the CloudFront distribution.

1) Create a new S3 bucket that only contains one file: robots.txt. That will be the robots.txt for your CloudFront domain.

2) Go to your distribution settings in the AWS Console and click Create Origin. Add the bucket.

3) Go to Behaviors and click Create Behavior: Path Pattern: robots.txt Origin: (your new bucket)

4) Set the robots.txt behavior at a higher precedence (lower number).

5) Go to invalidations and invalidate /robots.txt.

Now abc123.cloudfront.net/robots.txt will be served from the bucket and everything else will be served from your domain. You can choose to allow/disallow crawling at either level independently.

Another domain/subdomain will also work in place of a bucket, but why go to the trouble.

like image 126
Luke Lambert Avatar answered Oct 05 '22 17:10

Luke Lambert


You need to add a robots.txt file and tell crawlers not to index content under static1.mainsite.com.

In CloudFront you can control the hostname with which CloudFront will access your server. I suggest using a specific hostname to give to CloudFront which is different than you regular website hostname. That way you can detect a request to that hostname and serve a robots.txt which disallows everything (unlike your regular website robots.txt)

like image 20
Eran Sandler Avatar answered Oct 05 '22 17:10

Eran Sandler