The reasoning is simple. I don't want our services to be publicly accessible; however, our office needs access to those services. The services I'm talking about include git, chef, apt, jenkins, and more.
These services are not the only issue. Imagine a problem in production that requires manual debugging. I would have to tunnel through the NAT instance manually just to debug the problem server. When I'm having any issue in production, the last thing I want is an extra step.
VPC Supported VPN
At first, I tried to use the ipsec VPN that is offered by AWS. The cost is negligible at $36/month. The benefits include fault tolerance, automatic configuration, and hardware support for Cisco, Asa, and a bunch of other routers. Two hours later, I was no closer to having a working VPN.
Lucidchart is a small, growing startup. I came in one Saturday and setup the network, cat5 cables, wifi, internal DNS, DHCP, and OpenVPN. We don't have thousands of dollars to spend on fancy equipment and Cisco routers. I fashioned the custom router out of an old desktop nobody was using. I'm leading to the fact that none of the configurations AWS offered worked on my custom router.
Having spent the two hours in vain, I chose to setup my own VPN using a tool I was already familiar with - OpenVPN. Immediately, two different propositions arose in my mind. First, a single VPN connection between our office and the NAT instances (load balanced for fault tolerance). Second, four VPN connections, one between the office and each NAT.
I'll discuss both strategies in some detail, but it should be noted that I implemented and tested both, and both work great under normal circumstances. I'll compare both strategies during normal use, in an entire AZ outage, and in a network outage between AZs. These scenarios gave me my basis for choosing the strategy for Lucidchart.
The single "load balanced" connection in OpenVPN is not load balanced at all. Basically, on failure, the client will just reconnect to a different server. See the images and captions for failure details.
|Only one connection is active at a time. All traffic between the office and the VPC will go over this single connection.|
|During an entire AZ failure, if the VPN server is in the failed AZ, the office client will notice, and connect to another server. This fixes the connection entirely, except for instances that are caught in the downed AZ.|
|During a network outage between AZs, the client will not reconnect to any other server, so you will be unable to reach some of the AZs in your VPC.|
In comparison, the multiple connection strategy has one connection per AZ. Only packets for IPs found within the private and public subnets of an AZ will be routed to the NAT associated with that AZ.
|Each NAT handles traffic for its own AZ. The office router routes traffic to a specific NAT.|
|During an entire AZ failure, only the connection associated with that AZ goes down. All of the other connections stay up.|
|During a network outage between AZs, the VPN is unharmed, because there is no cross-AZ traffic.|
I hope it's clear to you why I chose the second strategy, 4 separate connections. There is a tiny amount more setup, but the difference is literally 5 minutes of work. One could argue that this way has more moving parts. One could also argue that a car has more moving parts than a unicycle; that doesn't mean you'd ride a unicycle 30 miles to work.
If either of these work for you, and you want to get started, I highly recommend reading the well-written documentation of how to use OpenVPN. That, and about 30 minutes of configuration, is all it took for me to get everything working.
Update: I've posted a tutorial, including configuration files, for OpenVPN on AWS VPC.