Upgrading instance class on Aurora cluster with terraform

We recently decided we want to upgrade one of our Aurora RDS clusters from db.r3.large to db.r4.large. Our entire environment is managed by Terraform. It was not clear from Terraform documentation what would happen if we just changed the instance size and applied the changes. Would Terraform be smart enough to upgrade the writer instance in AZ1, failing over to the reader in AZ2, and then when that was complete, upgrade the newly promoted writer instance, failing back to the new instance in AZ1?

It turns out that it is not that smart.

We built an experimental cluster to try out some different scenarios. First we tried simply changing the instance_class in the aws_rds_cluster_instance and then applying the terraform changes. Terraform proceeded to update both instances at the same time. The Aurora cluster tried to fail over, but when both instances are going down at the same time, you really can’t fail over. We experienced about 11 minutes of downtime with this approach.

Our second approach was a little more successful. We edited the Terraform file to change the instance_class, but we did not apply the changes. Instead, we went into the AWS console and manually changed the writer instance’s class to db.r4.large. Aurora dutifully failed over to the reader in AZ2 before bringing down the writer in AZ1. This caused about 45 seconds of downtime. We waited for the new instance to become fully ready, and then we manually updated the newly promoted writer instance in AZ2. This failed us back to the new instance in AZ1 while AZ2 was restarted with the new instance class. This caused another 30 seconds of downtime.

You could reduce the downtime a little bit if you updated the reader first and then updated the writer, failing over from AZ1 to AZ2 and then leaving the writer in AZ2. This technique would require only a single failover. If your dual-AZ implementation is solid, it really shouldn’t matter which AZ your writer runs in.

It would be nice if Terraform was smart enough to follow a similar process when changing the instance_class of the RDS cluster instances. I’m not familiar enough with the internals of Terraform to know whether such intelligent behavior is realistically within the realm of possibilities. Thankfully, we don’t do a lot of these kinds of operations, and when we did, we were aware enough to realize that Terraform might not handle the situation gracefully.

Leave a comment

Your email address will not be published. Required fields are marked *