While Terraform backends provide locking for Terraform state, they cannot help you with locking at the level of the Terraform code itself. In particular, if two team members are deploying the same code to the same environment, but from different branches, you'll run into conflicts that locking can't prevent.
There are four potential types of issues that you could experience with Terraform: language, state, core, and provider errors. Starting from the type of error closest to the user: Language errors: The primary interface for Terraform is the HashiCorp Configuration Language (HCL), a declarative configuration language.
I am also in a state of migrating existing AWS infrastructure to Terraform so shall aim to update the answer as I develop.
I have been relying heavily on the official Terraform examples and multiple trial and error to flesh out areas that I have been uncertain in.
.tfstate
files
Terraform config can be used to provision many boxes on different infrastructure, each of which could have a different state. As it can also be run by multiple people this state should be in a centralised location (like S3) but not git.
This can be confirmed looking at the Terraform .gitignore
.
Developer control
Our aim is to provide more control of the infrastructure to developers whilst maintaining a full audit (git log) and the ability to sanity check changes (pull requests). With that in mind the new infrastructure workflow I am aiming towards is:
Edit 1 - Update on current state
Since starting this answer I have written a lot of TF code and feel more comfortable in our state of affairs. We have hit bugs and restrictions along the way but I accept this is a characteristic of using new, rapidly changing software.
Layout
We have a complicated AWS infrastructure with multiple VPC's each with multiple subnets. Key to easily managing this was to define a flexible taxonomy that encompasses region, environment, service and owner which we can use to organise our infrastructure code (both terraform and puppet).
Modules
Next step was to create a single git repository to store our terraform modules. Our top level dir structure for the modules looks like this:
tree -L 1 .
Result:
├── README.md
├── aws-asg
├── aws-ec2
├── aws-elb
├── aws-rds
├── aws-sg
├── aws-vpc
└── templates
Each one sets some sane defaults but exposes them as variables that can be overwritten by our "glue".
Glue
We have a second repository with our glue
that makes use of the modules mentioned above. It is laid out in line with our taxonomy document:
.
├── README.md
├── clientA
│ ├── eu-west-1
│ │ └── dev
│ └── us-east-1
│ └── dev
├── clientB
│ ├── eu-west-1
│ │ ├── dev
│ │ ├── ec2-keys.tf
│ │ ├── prod
│ │ └── terraform.tfstate
│ ├── iam.tf
│ ├── terraform.tfstate
│ └── terraform.tfstate.backup
└── clientC
├── eu-west-1
│ ├── aws.tf
│ ├── dev
│ ├── iam-roles.tf
│ ├── ec2-keys.tf
│ ├── prod
│ ├── stg
│ └── terraform.tfstate
└── iam.tf
Inside the client level we have AWS account specific .tf
files that provision global resources (like IAM roles); next is region level with EC2 SSH public keys; Finally in our environment (dev
, stg
, prod
etc) are our VPC setups, instance creation and peering connections etc. are stored.
Side Note: As you can see I'm going against my own advice above keeping terraform.tfstate
in git. This is a temporary measure until I move to S3 but suits me as I'm currently the only developer.
Next Steps
This is still a manual process and not in Jenkins yet but we're porting a rather large, complicated infrastructure and so far so good. Like I said, few bugs but going well!
Edit 2 - Changes
It's been almost a year since I wrote this initial answer and the state of both Terraform and myself have changed significantly. I am now at a new position using Terraform to manage an Azure cluster and Terraform is now v0.10.7
.
State
People have repeatedly told me state should not go in Git - and they are correct. We used this as an interim measure with a two person team that relied on developer communication and discipline. With a larger, distributed team we are now fully leveraging remote state in S3 with locking provided by DynamoDB. Ideally this will be migrated to consul now it is v1.0 to cut cross cloud providers.
Modules
Previously we created and used internal modules. This is still the case but with the advent and growth of the Terraform registry we try to use these as at least a base.
File structure
The new position has a much simpler taxonomy with only two infx environments - dev
and prod
. Each has their own variables and outputs, reusing our modules created above. The remote_state
provider also helps in sharing outputs of created resources between environments. Our scenario is subdomains in different Azure resource groups to a globally managed TLD.
├── main.tf
├── dev
│ ├── main.tf
│ ├── output.tf
│ └── variables.tf
└── prod
├── main.tf
├── output.tf
└── variables.tf
Planning
Again with extra challenges of a distributed team, we now always save our output of the terraform plan
command. We can inspect and know what will be run without the risk of some changes between the plan
and apply
stage (although locking helps with this). Remember to delete this plan file as it could potentially contain plain text "secret" variables.
Overall we are very happy with Terraform and continue to learn and improve with the new features added.
We use Terraform heavily and our recommended setup is as follows:
We highly recommend storing the Terraform code for each of your environments (e.g. stage, prod, qa) in separate sets of templates (and therefore, separate .tfstate
files). This is important so that your separate environments are actually isolated from each other while making changes. Otherwise, while messing around with some code in staging, it's too easy to blow up something in prod too. See Terraform, VPC, and why you want a tfstate file per env for a colorful discussion of why.
Therefore, our typical file layout looks like this:
stage
└ main.tf
└ vars.tf
└ outputs.tf
prod
└ main.tf
└ vars.tf
└ outputs.tf
global
└ main.tf
└ vars.tf
└ outputs.tf
All the Terraform code for the stage VPC goes into the stage
folder, all the code for the prod VPC goes into the prod
folder, and all the code that lives outside of a VPC (e.g. IAM users, SNS topics, S3 buckets) goes into the global
folder.
Note that, by convention, we typically break our Terraform code down into 3 files:
vars.tf
: Input variables.outputs.tf
: Output variables.main.tf
: The actual resources.Typically, we define our infrastructure in two folders:
infrastructure-modules
: This folder contains small, reusable, versioned modules. Think of each module as a blueprint for how to create a single piece of infrastructure, such as a VPC or a database.infrastructure-live
: This folder contains the actual live, running infrastructure, which it creates by combining the modules in infrastructure-modules
. Think of the code in this folder as the actual houses you built from your blueprints.A Terraform module is just any set of Terraform templates in a folder. For example, we might have a folder called vpc
in infrastructure-modules
that defines all the route tables, subnets, gateways, ACLs, etc for a single VPC:
infrastructure-modules
└ vpc
└ main.tf
└ vars.tf
└ outputs.tf
We can then use that module in infrastructure-live/stage
and infrastructure-live/prod
to create the stage and prod VPCs. For example, here is what infrastructure-live/stage/main.tf
might look like:
module "stage_vpc" {
source = "git::[email protected]:gruntwork-io/module-vpc.git//modules/vpc-app?ref=v0.0.4"
vpc_name = "stage"
aws_region = "us-east-1"
num_nat_gateways = 3
cidr_block = "10.2.0.0/18"
}
To use a module, you use the module
resource and point its source
field to either a local path on your hard drive (e.g. source = "../infrastructure-modules/vpc"
) or, as in the example above, a Git URL (see module sources). The advantage of the Git URL is that we can specify a specific git sha1 or tag (ref=v0.0.4
). Now, not only do we define our infrastructure as a bunch of small modules, but we can version those modules and carefully update or rollback as needed.
We've created a number of reusable, tested, and documented Infrastructure Packages for creating VPCs, Docker clusters, databases, and so on, and under the hood, most of them are just versioned Terraform modules.
When you use Terraform to create resources (e.g. EC2 instances, databases, VPCs), it records information on what it created in a .tfstate
file. To make changes to those resources, everyone on your team needs access to this same .tfstate
file, but you should NOT check it into Git (see here for an explanation why).
Instead, we recommend storing .tfstate
files in S3 by enabling Terraform Remote State, which will automatically push/pull the latest files every time you run Terraform. Make sure to enable versioning in your S3 bucket so you can roll back to older .tfstate
files in case you somehow corrupt the latest version. However, an important note: Terraform doesn't provide locking. So if two team members run terraform apply
at the same time on the same .tfstate
file, they may end up overwriting each other's changes.
Edit 2020: Terraform now supports locking: https://www.terraform.io/docs/state/locking.html
To solve this problem, we created an open source tool called Terragrunt, which is a thin wrapper for Terraform that uses Amazon DynamoDB to provide locking (which should be completely free for most teams). Check out Add Automatic Remote State Locking and Configuration to Terraform with Terragrunt for more info.
We've just started a series of blog posts called A Comprehensive Guide to Terraform that describes in detail all the best practices we've learned for using Terraform in the real world.
Update: the Comprehensive Guide to Terraform blog post series got so popular that we expanded it into a book called Terraform: Up & Running!
Previously remote config
allowed this but now has been replaced by "backends", so terraform remote is not anymore available.
terraform remote config -backend-config="bucket=<s3_bucket_to_store_tfstate>" -backend-config="key=terraform.tfstate" -backend=s3
terraform remote pull
terraform apply
terraform remote push
See the docs for details.
Covered in more depth by @Yevgeny Brikman but specifically answering the OP's questions:
What's the best practice for actually managing the terraform files and state?
Use git for TF files. But don't check State files in (i.e. tfstate). Instead use Terragrunt
for sync / locking of state files to S3.
but do I commit tfstate as well?
No.
Should that reside somewhere like S3?
Yes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With