TL;DR: Move one thing at a time and don’t leave the state behind!

Refactoring: form, not function

As with any refactoring, the focus in on the form - not the function. Your goal is changing the structure of the code while preserving the current behavior.

In the context of infrastructure configuration, preserving the current behavior means no change in the actual infrastructure resources. In the context of Terraform, that means ensuring your plan has no changes (via terraform plan -detailed-exitcode).

However, it is not always simple to avoid changes in behavior when refactoring Terraform configuration.

Terraform ate my infra

Let’s say you make a code change to rename a resource from bucket_test to log_storage. All Terraform knows about is that bucket_test is gone and log_storage has come. For this reason, when asked to apply that change, it will faithfully delete bucket_test and create a new log_storage.

Even though unintended (and potentially catastrophic), such change in behavior are not rare: it’s easy to forget that the Terraform state is also part of your configuration.

In order to avoid the change in behavior you have to make a corresponding change in the state with terraform state mv.

Waltzing with Terraform

The process becomes:

  1. Make a small change in the code’s structure.
  2. Make a equivalent change’s to the state’s structure.
  3. Ensure no change in behavior.

The quirks of remote state

Things get a bit trickier when the configuration uses a remote state - which is also often the case.

When refactoring with a remote state, you should also keep in mind that:

  1. there’s no traceability for the state change
  2. there’s no atomicity for the state and code change

A state change is integrated when actual state is persisted with the said change, and a code change is integrated by being merged to master and terraform apply-ed against the state (ideally as part of a CI pipeline).

When using a local state, traceability and atomicity can both be guaranteed by ensuring the code change and the state change are part of the same commit - the unit of integration. With a remote state, however, the state must be integrated first and the world must be stopped until the code change is integrated - state and code changes are coupled in time.

Anything using the Terraform configuration between each integration is working with an inconsistent system, and the consequence we know: unintented (and potentially catastrophic) behavior change! Say you integrate the state change, but some other commit is integrated before your code change. Terraform will try to undo your state change because its current code doesn’t know about your state refactoring.

Note the same can happen if you integrate multiple commits at once (i.e. you merge by rebasing, not squashing) and your CI triggers one build for each intermediate commit rather than only for the latest (been there, done that).

The lack of traceability can be addressed with communication. Make sure your commit message mentions it required manual changes to the state, for example. I usually include the terraform state commands in the commit message or the change request.

Atomicity can be achieved by ensuring exclusive access to the Terraform configuration during your refactoring, ideally for the least duration possible. For this reason is important to keep your refactoring as focused (and small) as possible, and ideally on a single commit.

Finally, keep in mind that lack of atomicity also makes reverting the code change less than trivial. That’s another good reason for using a storage backend that supports versioning.

Examples

TODO.