Wednesday, January 23, 2013

VPC Migration: NATs & Bandwidth Bottleneck

I ran into an unexpected issue during the migration to VPC over the weekend. The NAT instances, all of which are t1.micro size, could not handle the network traffic between the web servers and the backend servers. Our traffic backed up to the point that requests started timing out. The disastrous result was downtime.



Bandwidth Problem

The last thing I did to the production system before the weekend was move Lucidchart's web servers into the VPC. These web servers are located in two of the four AZs in us-east. Remember from my VPC setup that each AZ has a single NAT instance of size t1.micro. This means that all traffic between the web servers and the MySQL database, MongoDB database, cache servers, and internal services was being funneled through two t1.micro instances.

Moving the web servers into VPC didn't immediately trigger any errors. The bandwidth was sustainable until a couple of rather obnoxious queries were made to the MySQL server, queries that returned mountains of data. To help things along, our javascript client, after timing out, retried the same query. As soon as the traffic exceeded the t1.micro's capacity, we had issues.


At 08:00 UTC in the graph above, the network traffic through one of our NATs jumped from 5MBps to 13MBps. While we have had higher spikes, the sustained throughput for the next hour and a half wreaked havoc on all of the users. Eventually, everything calmed down, and we got to the fix.

Bandwidth Solution

It only took one more bout of downtime to force an upgrade of the t1.micro to an m1.large - the difference in bandwidth was enough to handle any other issues. Keep in mind that this all happened over a weekend, so I was reluctant to finish the migration in its entirety, which also would have solved the problem. The NAT instances are EBS backed, so upgrading was fairly simple.

If you followed my VPC setup, and are running into this issue, here are the steps for upgrading your t1.micro to an m1.large. It does require downtime on a single availability zone at a time.
  1. Start a new instance, size m1.large, in the same AZ as the t1.micro NAT.
  2. Stop the m1.large NAT instance.
  3. Remove all instances, in the same AZ, from ELBs or other load balancers.
  4. Stop the t1.micro NAT.
  5. Detach and delete the root EBS volume from the m1.large NAT.
  6. Detach the root EBS volume from the t1.micro NAT, and attach it to the m1.large NAT.
  7. Delete the t1.micro NAT.
  8. Disable the Source/Dest check on the m1.large NAT.
  9. Reassign the elastic IP from the t1.micro NAT to the m1.large NAT.
  10. Start the m1.large NAT.
  11. Update the affected subnet routing table's default entry to go to the m1.large NAT.
  12. Add all instances, in the same AZ, to ELBs or other load balancers.
  13. Repeat all steps for other AZs.
Switching from t1.micro to m1.large should, at the very least, hold your system over until you can finish the migration to VPC - the real solution.

Mitigation

While this problem was unfortunate and unexpected, it was certainly preventable. In hindsight, I can think of three separate ways that this issue could have been mitigated.
  1. Do the entire transition in a single X hour period. This method does not guarantee no issues, as anything can go wrong in those X hours. Also, this may not be plausible for every scenario.
  2. Begin with a larger NAT instance. Whether it's an m1.large or a hi1.4xlarge, plan ahead for the maximum bandwidth you will need, double it, and use the instance size that accommodates it.. This beefy NAT size is only temporary. After the migration to VPC is over, the NAT can be replaced with a smaller instance.
  3. Perform two separate migrations. The first migration would be to a public subnet, where every one of the instances has its own elastic IP address. The second migration would be to a private subnet, but only for those instances which belong in a private subnet. Not only is this tedious, it's twice as error-prone and has issues with auto-scaled instances.
Of the techniques listed above, I would choose the second. It guarantees no issues while minimizing errors. Whatever your flavor of mitigation, at least be aware of the bandwidth bottleneck that lurks in a migration to VPC.

4 comments:

  1. Michael, Is there any tool or service that could provide the following high availability services a) allow for VPC to networked and monitored via a dashboard ; If tunnel between two or more VPC goes down, tool auto detects and activates a standby tunnel to ensure uptime and alerts admin b) Keep the NAT highly avaialble by setting up a standy NAT instance and montiroing it
    2) as well setup a HA NAT

    ReplyDelete