Keeping this service running smoothly is a high priority. On July 8, a code release to the Lucidchart editor uncovered several issues with the PDF service. More specifically, the new image manager allowed users to retrieve images from Facebook, Flickr, and Dropbox. With these changes, our robust system fell on its face. PDF JVMs were crashing hundreds of times a day, causing those servers to be terminated and replaced with new ones. It wreaked havoc on our users and our uptime.
Tracing the First Issue
As with any new feature, it takes a while for users to notice it, become familiar with it, and use it naturally. This was no different with the image integrations. For that reason, we saw no issues with the PDF service until July 9, and didn't see minute-by-minute crashes until a few days after that. It seemed that the service had spontaneously combusted with no release and no server change as a catalyst.
We've seen this sort of issue in the past, where a set of large accounts all gather together and decide to diagram on the same day and same hour. So we ignored it, hoping it would go away—but it never did.
We had a few wild goose chases. Some were pretty quick to narrow down, but others took a day or so. Here's a quick rundown of the suspects.
- We recently added a StatsD client to the PDF service to track response times, error counts, etc.
- A change to the PDF service that allowed creator, last modifier, and last modified timestamp to be included in documents.
- JVM heap size.
- A recent change to our image service that locally stores the images referenced on the Internet.
- An in-memory cache, bounded only by number of elements, not total size, to temporarily store images and fonts.
Ten days later, somebody pointed out that the image manager had been released. It would allow users to upload large images more frequently to their documents. After just minutes of searching for memory leaks in a production heap dump, we found our issue.
In one function, we use an ImageIO ImageReader to down-sample an overly large image to a more reasonable size before we attempt to parse it, manipulate it, and put it in the generated PDF. Neither I nor the employee who wrote this code were familiar enough with an ImageReader object to know that you must call dispose() on the object when finished using it. The documentation makes it clear what the function does, but is completely void of information about requiring the call.
So the fix was simple. In the function, we put reader.dispose(). We released it to production and solved the problems. To give you an idea of how big of a change it was, check out this graph.
|Each line is an individual cloud server's JVM Old heap usage. Notice that no line stays in existence very long. Once it uses all of its memory, the JVM crashes, the process dies, and the server is replaced.|
Can you tell where the fix went into place? I hope so, but if not, it was the afternoon (UTC) of July 19. As soon as the release was fixed, all of our servers stopped crashing and the service returned to normal. It was a success!
The Second Issue
The servers continued for about a week before one of them crashed again. The other two servers remained in place, but the memory usage indicated that another crash would happen within hours. Having already learned our lesson on diagnosing heap issues, we jumped straight to another heap dump.
Minutes later, we discovered that our problem had to do with File.deleteOnExit(). This one, sad to say, was completely my fault. The documentation even specifies that the file will be deleted when the JVM terminates. I botched it.
The way we create PDFs and images requires us to store temporary files to disk. In one function, we write the temporary files, call the callback, and then clean up the temporary files. We specify deleteOnExit in addition to the normal cleanup, just in case there is a race condition on JVM termination. I had assumed that if a file was cleaned up using File.delete(), the deleteOnExit would be nullified and no evil would ensue. I was wrong. What really happens is this: java.io.DeleteOnExitHook holds a set of files to clean up on JVM termination. That set grows to be 3GB, and then the JVM crashes.
The fix was to remove all calls to File.deleteOnExit. This leaves a race condition in the code for cleaning up temporary files, but it seems better than a memory leak, so it was committed and released to production. The resultant graph is below.
|Again, each line is an individual cloud server's JVM Old heap usage. This graph is a superset of the graph above.|
As you can see, after the first fix, memory continued to climb to the heap max, at which point one server died. The other two remained in limbo for a couple hours til we released the deleteOnExit fix. Since that time, all has been well.
This was a pretty valuable lesson for us. I'm a PHP programmer—one who's completely unfamiliar with the JVM. So let me share what we learned.
- If your issue deals with the JVM heap, inspect the heap. Duh! You can do this using jmap and jhat. Netbeans also allows you to open a binary heap dump and inspect it, which tends to be more useful with deep inspections.
- Use a good reporting tool. You may be able to tell that our Graphite data cuts off on the left side. We just started using Graphite to monitor JVMs, CPU, IO, etc. Prior to that, we were using Zabbix. Getting this same graph from Zabbix is impossible because Zabbix is so slow and doesn't really make this kind of thing easy. Getting the graph from Graphite took seconds and helped us verify that the issue had been resolved.
- Unexpected JVM crashes don't last days at a time. Our traffic is sporadic, but we shouldn't have waited so long to diagnose and fix these issues.
- Watch for memory leaks. Coming from PHP, I know how to minimize memory usage, but even if I don't, every request receives a new process. Unless the individual requests start getting OOM errors, I don't need to worry about it. Not so with the JVM.
- Use AWS autoscaling and load balancing for automatic healing. With the frequency that our servers were dying, the service could have easily been down for days. With no effort, our service continued to run for weeks.
Have you ever had a similar problem? If so, sound off in the comments below!