I am using CloudFront to serve images, css and js files for my website using the custom origin option with subdomains CNAMEd to my account. It works pretty well. Main site: www.mainsite.com <ol> <li>static1.mainsite.com</li> <li>static2.mainsite.com</li> </ol> Sample page: www.mainsite.com/summary/page1.htm This page calls an image from static1.mainsite.com/images/image1.jpg If Cloudfront has not already cached the image, it gets the image from www.mainsite.htm/images/image1.jpg This all works fine. The problem is that google alert has reported the page as being found at both: <ul> <li>http://www.mainsite.com/summary/page1.htm</li> <li>http://static1.mainsite.com/summary/page1.htm</li> </ul> The page should only be accessible from the www. site. Pages should not be accessible from the CNAME domains. I have tried to put a mod rewrite in the .htaccess file and I have also tried to put a exit() in the main script file. But when Cloudfront does not find the static1 version of the file in its cache, it calls it from the main site and then caches it. Questions then are: <pre class="prettyprint"><code>1. What am I missing here? 2. How do I prevent my site from serving pages instead of just static components to cloudfront? 3. How do I delete the pages from cloudfront? just let them expire? </code></pre> Thanks for your help. Joe

[I know this thread is old, but I'm answering it for people like me who see it months later.] From what I've read and seen, CloudFront does not consistently identify itself in requests. But you can get around this problem by overriding robots.txt at the CloudFront distribution. 1) Create a new S3 bucket that only contains one file: robots.txt. That will be the robots.txt for your CloudFront domain. 2) Go to your distribution settings in the AWS Console and click Create Origin. Add the bucket. 3) Go to Behaviors and click Create Behavior: Path Pattern: robots.txt Origin: (your new bucket) 4) Set the robots.txt behavior at a higher precedence (lower number). 5) Go to invalidations and invalidate /robots.txt. Now abc123.cloudfront.net/robots.txt will be served from the bucket and everything else will be served from your domain. You can choose to allow/disallow crawling at either level independently. Another domain/subdomain will also work in place of a bucket, but why go to the trouble.

Cloudfront Custom Origin Is Causing Duplicate Content Issues

Tags:

duplicates

amazon-cloudfront

cname

I am using CloudFront to serve images, css and js files for my website using the custom origin option with subdomains CNAMEd to my account. It works pretty well.

Main site: www.mainsite.com

static1.mainsite.com
static2.mainsite.com

Sample page: www.mainsite.com/summary/page1.htm

This page calls an image from static1.mainsite.com/images/image1.jpg

If Cloudfront has not already cached the image, it gets the image from www.mainsite.htm/images/image1.jpg

This all works fine.

The problem is that google alert has reported the page as being found at both:

http://www.mainsite.com/summary/page1.htm
http://static1.mainsite.com/summary/page1.htm

The page should only be accessible from the www. site. Pages should not be accessible from the CNAME domains.

I have tried to put a mod rewrite in the .htaccess file and I have also tried to put a exit() in the main script file.

But when Cloudfront does not find the static1 version of the file in its cache, it calls it from the main site and then caches it.

Questions then are:

1. What am I missing here?
2. How do I prevent my site from serving pages instead of just static components to cloudfront?
3. How do I delete the pages from cloudfront? just let them expire?

Thanks for your help.

Joe

805

asked Jan 06 '12 04:01

Joe Boxer

2 Answers

[I know this thread is old, but I'm answering it for people like me who see it months later.]

From what I've read and seen, CloudFront does not consistently identify itself in requests. But you can get around this problem by overriding robots.txt at the CloudFront distribution.

1) Create a new S3 bucket that only contains one file: robots.txt. That will be the robots.txt for your CloudFront domain.

2) Go to your distribution settings in the AWS Console and click Create Origin. Add the bucket.

3) Go to Behaviors and click Create Behavior: Path Pattern: robots.txt Origin: (your new bucket)

4) Set the robots.txt behavior at a higher precedence (lower number).

5) Go to invalidations and invalidate /robots.txt.

Now abc123.cloudfront.net/robots.txt will be served from the bucket and everything else will be served from your domain. You can choose to allow/disallow crawling at either level independently.

Another domain/subdomain will also work in place of a bucket, but why go to the trouble.

126

answered Oct 05 '22 17:10

Luke Lambert

You need to add a robots.txt file and tell crawlers not to index content under static1.mainsite.com.

In CloudFront you can control the hostname with which CloudFront will access your server. I suggest using a specific hostname to give to CloudFront which is different than you regular website hostname. That way you can detect a request to that hostname and serve a robots.txt which disallows everything (unlike your regular website robots.txt)

answered Oct 05 '22 17:10

Eran Sandler

Related questions
                            
                                Fast sort algorithms for arrays with mostly duplicated elements?
                            
                                Merge multiple data tables with duplicate column names
                            
                                Remove duplicate rows from Pandas dataframe where only some columns have the same value
                            
                                How to prevent duplicate usernames when people register?
                            
                                Show duplicates in Mathematica
                            
                                Get rid of rows with duplicate attributes in R
                            
                                Constants in Objective-C and "duplicate symbol" linker error
                            
                                Check std::vector has duplicates
                            
                                Clone JavaFX Node?
                            
                                Android :app:transformClassesWithJarMergingForDebug FAILED - ZipException: duplicate entry
                            
                                PHP/MySQL: Getting Multiple Columns With the Same Name in Join Query Without Aliases? [duplicate]
                            
                                Java serialization and duplicate objects
                            
                                pair-wise duplicate removal from dataframe [duplicate]
                            
                                zipping files with the same name in different folders using 7z @listfile feature
                            
                                How can I tell if a rectangular matrix has duplicate rows in MATLAB?
                            
                                What are the implications of having duplicate classes in java jar?
                            
                                sql query to find the duplicate records
                            
                                How to remove duplicates from a two-dimensional array? [closed]
                            
                                STL + Ordered set + without duplicates
                            
                                Binding attribute causes duplicate component ID found in the view

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With