AWS CloudFront duplicate content issue Solution

AWS Cloudfront is like proxy service, e.g. A request to Cloudfront will be served with content hosted on origin server. You chose your origin server, in the process Cloudfront will cache any files it has already served and will return the cached version for future requests.

AWS Cloudfront process:

Request -> CloudFront -> Origin Server

Origin Server -> CloudFront -> Response

 

Upon searching on internet other solutions suggest to use robot.txt, but the issue with robot.txt is that you have to make changes to your site plus it will block access to CSS and JS files too. As now Google bot are like modern browsers they need access to CSS files (to detect if your site is responsive/mobile friendly).

This solution assumes that you want to only serve static files like CSS, JS and image files.

You can use AWS Cloudfront service to cache a complete domain of your choosing, but this would create duplicate for all the content served by origin server.

So for this example we suppose that you have domain static.example.com and you are serving all its content from XXXX.cloudfront.net, this is not good SEO as it would cause duplicate content issue with search engines.

To get around the issue: once you have created your CloudFront distribution, go to “Origins” tab and add a new origin to domain say “non-existent.example.com

Once you have added a new origin to a domain that doesn’t exists any request to domain won’t get served by CloudFront.

Now go to “Behaviors” tab edit the default behavior and set origin to non existent, after this all requests to XXXX.cloudfront.net should give error(given that it is still not cached by edge locations).

Now create a new behavior with “Path Pattern” set to something like *.css for CSS files and set the origin to static.example.com, repeat this step for all the path patterns that you actually want to resolve to a successful request.

The above setup will ensure that your distribution only serve the paths patterns that you have included.

 

Problems faced configuring LAMP

1. All the virtual host need to use the private IP address in the virtual host directive rather the elastic IP or it will not work.

i.e.

Good

<VirtualHost 10.202.150.134:80>

Not

<VirtualHost *:80>

<VirtualHost 0.0.0.0:80>

<VirtualHost 184.72.230.132:80>

2. Access denied for user ‘www-data’@’localhost’ shows up if the first attempt to connect to database has failed and application has used mysql_query after that, so php won’t find any active connections so it would try default settings to connect to database, quite hard to fix.