Architected Availability

Monday, January 30, 2017

Blue Matador

A quick post to introduce the software monitoring company I've been building over the last year. It's called Blue Matador, Inc. When your website or application goes down, Blue Matador helps you understand what broke, why it broke, and how to fix it quickly.

Our first product is Lumberjack: a centralized log management system that gives you better insights, faster. It takes 10 minutes to set up, and crunches your log data within 2 seconds of realtime.

I've spent years without vacations, days without rest, and nights without sleep - all because I was on call and fixing problems. Sound like you? From one sysadmin to another, it's not worth it. Get better tooling, solve your problems faster, and fix root causes.

If you mention this post, I'll give you a discount, just ask for me (Matthew the CEO).

Tuesday, August 5, 2014

Play! 2.3 Template Improvements

My thanks go out to Grant Klopper from The Guardian. Last week, he validated a change I made to the Play! Framework back in December. After upgrading from Play! 2.2 to 2.3, Grant noticed dramatic changes in response times and memory - both for the better. The changes were so awesome that Grant was able to shut down 2/3 of the servers running that app. A conversation on Twitter followed with James Roper, a lead developer at Typesafe, wherein the root cause for the improvements was discovered - my template fixes. A subsequent blog post from Typesafe was released about the story, thanking the contributors and soliciting further enhancements to their code base.

Response time of The Guardian website during upgrade to Play 2.3.
Copied from Grant's Tweet about response times mentioned above.

Without bragging about my awesomeness, I wanted to explain how I found and fixed the issue and then share some benchmarks of my own. Perhaps in this way, I can also solicit enhancements to the Play! framework, which has been a tremendous help in building state-of-the-art applications on the web.

Getting the Most From AWS Ephemeral Volumes

I had mentioned in How to Use LVM and LUKS with EBS Volumes that I used LVM and LUKS for the ephemeral volumes at Lucid Software. I have made the associated scripts public on Github, with an Apache 2.0 license. I'll take just a minute and describe what the script is and how we use it at Lucid.

How to Use LVM and LUKS with EBS Volumes

A while back, I had posted my findings on encryption at rest using LUKS. Circling back, here's the procedure I used. Although I was operating on Ubuntu 12.04 and EBS volumes, this same procedure can be used in many different scenarios.

Cloud Connect 2013

Another month, another conference. I am currently in Chicago, IL, getting ready to present at Cloud Connect Chicago. I'll be presenting Case Study: Lucidchart's Migration to VPC.

Zabbix vs Graphite

Monitoring solutions have been around for some time, but I still haven't found the perfect one. I first implemented Zabbix for Lucidchart in late 2011, and, just a few months ago, I installed Graphite. I'd like to take you through my decision process so you can find the right monitoring tool for your needs.

Comparing these two products is not easy, because they were designed to do different things. Zabbix was meant to be a server monitoring solution, while Graphite is more of a data collection and reporting tool. What I'd really like to see is a merger of the two tools, but that probably won't happen anytime soon.

PDF Service Memory Leaks

One of the most attractive features of Lucidchart is the direct mapping of pixels from screen to page. An essential part of this process is our PDF generator. JSON render data goes in and a PDF or an image comes out. Though it sounds simple, it contains 13k lines of Scala code, heavily uses Akka actors to gather and render fonts, images, and pages, depends on 8 internally maintained jars and 83 others, and is responsible for generating 50k PDFs and images a day (1.5M per month, 18.25M per year) at its current load. This is anything but a simple service.

Keeping this service running smoothly is a high priority. On July 8, a code release to the Lucidchart editor uncovered several issues with the PDF service. More specifically, the new image manager allowed users to retrieve images from Facebook, Flickr, and Dropbox. With these changes, our robust system fell on its face. PDF JVMs were crashing hundreds of times a day, causing those servers to be terminated and replaced with new ones. It wreaked havoc on our users and our uptime.