Architected Availability: Encryption at Rest using LUKS

At past companies, encryption at rest was done at the application layer. Only part of the data had to be encrypted, so code was inserted into the model that would encrypt the sensitive data before inserting into the database and decrypt after retrieval. This approach worked, and had no impact on the database - the hardest layer to scale.

At Lucidchart, we have failed to close large sales due to lack of encryption. Large companies want to make sure that their proprietary information is transmitted and stored using industry standard encryption. I took on the task to find a method of encryption that made the most sense for our use case, and had little overhead on our systems. After a lot of testing, benchmarking, and evaluating, I came to the conclusion that encrypting the disks on our database servers using LUKS was, and still is, the best solution.

LUKS vs Application-Level

Although I investigated a four different options, the two main competitors were encrypting at the application level and using LUKS. What follows is the benefits and drawbacks of each.

Application level encryption:

requires the encryption keys be stored in the webserver code or be available through a private service,
can be migrated with 100% uptime with little effort,
does not affect the performance of database servers,
requires a code change for every field in the database that should be encrypted,
can be disastrous if a development key is used in production,

LUKS:

requires manual intervention at every boot or storing the encryption keys on the server,
can be migrated with 100% uptime with a little effort,
reduced disk IO by up to 40%,
requires no code changes or maintenance,
allows multiple keys, in case any become corrupt or are lost,
allows a key to be split into multiple parts for safe key distribution,

LUKS Benchmarks

Because encrypting in the application doesn't affect the database servers, I didn't benchmark it. In the worst case, it would marginally affect our latency. To counteract the latency issue, we could simply provision more webservers to reduce context switching between running processes. Provisioning database servers is much harder - we haven't completed the migration to 100% sharded data.

Let me preface the data by emphasizing how many times I ran these benchmarks. I had 16 volumes each for m1.small, m1.medium, and m1.large. In my tests, I had assumed that every EBS volume was created equal. That is not the case. So, after about 3 hours trying to figure out why the results weren't making sense, I changed the test script to assume that no EBS volume was the same. To compensate, I compared an EBS volume only to itself. As I moved between the different AWS instance sizes, I transferred the EBS volumes to the new size.

Additionally, every single data point that you see is an average of 5 different runs of the benchmark script. After adjusting the script to assume no EBS is created equal, there were no outliers in the data.

In these first 2 graphs, which look strikingly similar until you notice the scale of the different graphs. As I stated earlier, running LUKS will lower the Disk IO 5-40% for a single EBS volume.

The next graph is formatted a little differently, even though it looks similar to the other ones. Instead of comparing 15 distinct EBS volumes to eachother, I am comparing the number of EBS volumes inside of a striped LVM volume group. By striping the EBS volumes, I was able to recoup some of the IO loss due to using LUKS. In fact, the m1.large, did extremely well on the writes, even exceeding the write throughput of just a single EBS volume without LUKS.

You should see that, while using LUKS definitely impacts the system, adding LVM to the mix seems to be the way to go, if you do choose LUKS.

To get these benchmarks, I used `dd if=/dev/zero of=/dev/xvdf1 bs=1M count=100` for reading and `dd if=/dev/xvdf1 of=/dev/null bs=1M count=100` for writing.

LVM & LUKS Instructions

Because of all the different ways to setup LVM & LUKS, I highly suggest you read the documentation for both and figure it out. It took me about 4 hours between the two, with LUKS being the more obnoxious to figure out. I've also posted my long set of instructions for using LVM & LUKS

Architected Availability

Wednesday, June 19, 2013

Encryption at Rest using LUKS

LUKS vs Application-Level

LUKS Benchmarks

LVM & LUKS Instructions

3 comments: