Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rails: dynamic robots.txt with erb

I'm trying to render a dynamic text file (robots.txt) in my Rails (3.0.10) app, but it continues to render it as HTML (says the console).

match 'robots.txt' => 'sites#robots'

Controller:

class SitesController < ApplicationController

  respond_to :html, :js, :xml, :css, :txt

  def robots
    @site = Site.find_by_subdomain # blah blah
  end

end

app/views/sites/robots.txt.erb:

Sitemap: <%= @site.url %>/sitemap.xml

But when I visit http://www.example.com/robots.txt I get a blank page/source, and the log says:

Started GET "/robots.txt" for 127.0.0.1 at 2011-11-21 11:22:13 -0500
  Processing by SitesController#robots as HTML
  Site Load (0.4ms)  SELECT `sites`.* FROM `sites` WHERE (`sites`.`subdomain` = 'blah') ORDER BY created_at DESC LIMIT 1
Completed 406 Not Acceptable in 828ms

Any idea what I'm doing wrong?

Note: I added this to config/initializers/mime_types, cause Rails was complaining about not knowing what the .txt mime type was:

Mime::Type.register_alias "text/plain", :txt

Note 2: I did remove the stock robots.txt from the public directory.

like image 715
imderek Avatar asked Nov 21 '11 16:11

imderek


2 Answers

NOTE: This is a repost from coderwall.

Reading up on some advice to a similar answer on Stackoverflow, I currently use the following solution to render a dynamic robots.txt based on the request's host parameter.

Routing

# config/routes.rb
#
# Dynamic robots.txt
get 'robots.:format' => 'robots#index'

Controller

# app/controllers/robots_controller.rb
class RobotsController < ApplicationController
  # No layout
  layout false

  # Render a robots.txt file based on whether the request
  # is performed against a canonical url or not
  # Prevent robots from indexing content served via a CDN twice
  def index
    if canonical_host?
      render 'allow'
    else
      render 'disallow'
    end
  end

  private

  def canonical_host?
    request.host =~ /plugingeek\.com/
  end
end

Views

Based on the request.host we render one of two different .text.erb view files.

Allowing robots

# app/views/robots/allow.text.erb # Note the .text extension

# Allow robots to index the entire site except some specified routes
# rendered when site is visited with the default hostname
# http://www.robotstxt.org/

# ALLOW ROBOTS
User-agent: *
Disallow:

Banning spiders

# app/views/robots/disallow.text.erb # Note the .text extension

# Disallow robots to index any page on the site
# rendered when robot is visiting the site
# via the Cloudfront CDN URL
# to prevent duplicate indexing
# and search results referencing the Cloudfront URL

# DISALLOW ROBOTS
User-agent: *
Disallow: /

Specs

Testing the setup with RSpec and Capybara can be done quite easily, too.

# spec/features/robots_spec.rb
require 'spec_helper'

feature "Robots" do
  context "canonical host" do
    scenario "allow robots to index the site" do
      Capybara.app_host = 'http://www.plugingeek.com'
      visit '/robots.txt'
      Capybara.app_host = nil

      expect(page).to have_content('# ALLOW ROBOTS')
      expect(page).to have_content('User-agent: *')
      expect(page).to have_content('Disallow:')
      expect(page).to have_no_content('Disallow: /')
    end
  end

  context "non-canonical host" do
    scenario "deny robots to index the site" do
      visit '/robots.txt'

      expect(page).to have_content('# DISALLOW ROBOTS')
      expect(page).to have_content('User-agent: *')
      expect(page).to have_content('Disallow: /')
    end
  end
end

# This would be the resulting docs
# Robots
#   canonical host
#      allow robots to index the site
#   non-canonical host
#      deny robots to index the site

As a last step, you might need to remove the static public/robots.txt in the public folder if it's still present.

I hope you find this useful. Feel free to comment, helping to improve this technique even further.

like image 62
Thomas Klemm Avatar answered Nov 16 '22 00:11

Thomas Klemm


One solution that works in Rails 3.2.3 (not sure about 3.0.10) is as follows:

1) Name your template file robots.text.erb # Emphasis on text vs. txt

2) Setup your route like this: match '/robots.:format' => 'sites#robots'

3) Leave your action as is (you can remove the respond_with in the controller)

def robots
  @site = Site.find_by_subdomain # blah blah
end

This solution also eliminates the need to explicitly specify txt.erb in the render call mentioned in the accepted answer.

like image 7
Nathan Avatar answered Nov 15 '22 22:11

Nathan