Tuesday, August 5, 2014

Play! 2.3 Template Improvements

My thanks go out to Grant Klopper from The Guardian. Last week, he validated a change I made to the Play! Framework back in December. After upgrading from Play! 2.2 to 2.3, Grant noticed dramatic changes in response times and memory - both for the better. The changes were so awesome that Grant was able to shut down 2/3 of the servers running that app. A conversation on Twitter followed with James Roper, a lead developer at Typesafe, wherein the root cause for the improvements was discovered - my template fixes. A subsequent blog post from Typesafe was released about the story, thanking the contributors and soliciting further enhancements to their code base.

Response time of The Guardian website during upgrade to Play 2.3.
Copied from Grant's Tweet about response times mentioned above.

Without bragging about my awesomeness, I wanted to explain how I found and fixed the issue and then share some benchmarks of my own. Perhaps in this way, I can also solicit enhancements to the Play! framework, which has been a tremendous help in building state-of-the-art applications on the web.

Monday, November 4, 2013

Getting the Most From AWS Ephemeral Volumes

I had mentioned in How to Use LVM and LUKS with EBS Volumes that I used LVM and LUKS for the ephemeral volumes at Lucid Software. I have made the associated scripts public on Github, with an Apache 2.0 license. I'll take just a minute and describe what the script is and how we use it at Lucid.

Wednesday, October 23, 2013

How to Use LVM and LUKS with EBS Volumes

A while back, I had posted my findings on encryption at rest using LUKS. Circling back, here's the procedure I used. Although I was operating on Ubuntu 12.04 and EBS volumes, this same procedure can be used in many different scenarios.

Cloud Connect 2013

Another month, another conference. I am currently in Chicago, IL, getting ready to present at Cloud Connect Chicago. I'll be presenting Case Study: Lucidchart's Migration to VPC.

Friday, August 23, 2013

Zabbix vs Graphite

Monitoring solutions have been around for some time, but I still haven't found the perfect one. I first implemented Zabbix for Lucidchart in late 2011, and, just a few months ago, I installed Graphite. I'd like to take you through my decision process so you can find the right monitoring tool for your needs.

Comparing these two products is not easy, because they were designed to do different things. Zabbix was meant to be a server monitoring solution, while Graphite is more of a data collection and reporting tool. What I'd really like to see is a merger of the two tools, but that probably won't happen anytime soon.

Friday, August 2, 2013

PDF Service Memory Leaks

One of the most attractive features of Lucidchart is the direct mapping of pixels from screen to page. An essential part of this process is our PDF generator. JSON render data goes in and a PDF or an image comes out. Though it sounds simple, it contains 13k lines of Scala code, heavily uses Akka actors to gather and render fonts, images, and pages, depends on 8 internally maintained jars and 83 others, and is responsible for generating 50k PDFs and images a day (1.5M per month, 18.25M per year) at its current load. This is anything but a simple service.

Keeping this service running smoothly is a high priority. On July 8, a code release to the Lucidchart editor uncovered several issues with the PDF service. More specifically, the new image manager allowed users to retrieve images from Facebook, Flickr, and Dropbox. With these changes, our robust system fell on its face. PDF JVMs were crashing hundreds of times a day, causing those servers to be terminated and replaced with new ones. It wreaked havoc on our users and our uptime.

Thursday, July 18, 2013

Memory Leak in Opscode Chef Daemon

At Lucidchart, we use Opscode's Chef to manage all of our servers, and have been since early 2012. We've used the same version of Ruby and Chef the whole time - no upgrades, downgrades, or new modules. Out of nowhere, every one of our servers start running out of memory. It caused our production site to run slow, servers to fail health checks, and the ops team to scramble.