Architected Availability: Memory Leak in Opscode Chef Daemon

At Lucidchart, we use Opscode's Chef to manage all of our servers, and have been since early 2012. We've used the same version of Ruby and Chef the whole time - no upgrades, downgrades, or new modules. Out of nowhere, every one of our servers start running out of memory. It caused our production site to run slow, servers to fail health checks, and the ops team to scramble.

A quick couple of ssh terminals later, we found that the chef-client daemon was running with more than a gigabyte of resident memory on all machines running longer than a few days. We restarted the daemon, and let it run a couple times. Each half an hour, it would pick up an additional 40MB, and never clean it up. At that rate, it only takes 25 runs (12.5 hours) to fill up 1GB of memory.

I found it very surprising, given that the chef-client daemon has only one job: every 30 minutes, give or take 20 seconds, run a process. This seems like it shouldn't take more than 5KB, ever! I googled around a bit, and found that other people had similar issues. So, we couldn't use the daemon anymore.

The Solution

Updating Chef or Ruby was out of the question. I didn't want the headache, and our servers were dying fast. I immediately restarted chef on all servers, to buy us 12 hours.

The best solution we came up with was running chef on a cronjob instead of the daemon. It won't have the long-running memory leak, and it's easy to do. I spent the next hour or so implementing our solution. Here are a couple things that our solution had to take into account.

splay. Chef is built to contact a master server once every X minutes (30 minutes for us). If every server in our infrastructure hit our chef master at the same time, I doubt that the master could handle the load. To handle this, the chef daemon uses a splay. The splay offsets the actual running of chef by some number of seconds each run. Over time, this distributes load across the entire 30 minute block. This is a must for our implementation.
race conditions. No two instances of chef should run at the same time. While chef supports the parallel execution, the output is unknown based on the cookbooks and scripts that are written. For that reason, we will not allow chef to run in parallel.
server registration. At boot time, every one of our servers registers itself with the appropriate roles, attributes, and environment. This runs before any of our services are started (S01); however, our chef installation is going to install a cronjob that could potentially run at the same as the first boot. We don't want to stop the registration and firstboot process.

Here is the cronjob and script that we came up with.

13,43 * * * * /usr/sbin/chef-client-cron

#!/bin/bash

PREVENT_CHEF_FILE=/var/lock/chef-firstboot
if [ -f "$PREVENT_CHEF_FILE" ]; then
echo "Cannot start chef-client while chef-firstboot is running"
exit 1
fi

# make sure the daemon isn't running anymore
echo "Stopping the chef service"
/etc/init.d/chef-client stop
killall chef-client

let splay=$RANDOM%30
let millis=$RANDOM%1000
echo "Sleeping for $splay.$millis seconds to splay"
sleep $splay.$millis

echo "making sure chef-client is really dead"
for pid in $(ps aux | grep "chef-client" | grep -v grep | awk '{ print $2; }'); do
kill -9 $pid
done

/usr/bin/chef-client -L /var/log/chef/client.log --once

It handles our 3 issues in these ways.

splay. You can see the <= 30 second splay built into the script. What you can't see is the selection of minutes to run in the cronjob. We create a uniform hash of the hostname, and mod it by 30 to get the minute offset for the first run. We add 30 seconds for the second run. This gives us enough of a splay that we haven't noticed any issues.
race conditions. You'll notice that in the middle of the script, we kill all chef client processes. This only becomes a problem if our chef run takes more than 30 minutes to run. Typically, it will take about 15 seconds, so we're nowhere near the red zone.
server registration. Our firstboot script creates a file in /var/lock/chef-firstboot. After running, the firstboot cleans that file up. Our script only needed to check the existence of that file, and die appropriately. Simple enough.

Conclusion

Don't use the chef-client in daemon mode. You can't trust it. Even if it works now, it may spontaneously break down at some point in the future. Cron is a much safer route that has been tried and tested.

Chef itself is good. We like using it for server administration. We chose it over puppet for its ability to query the status of the system. It is like every other piece of software, though, so know what bugs exist before you go gung-ho.

2 comments:

UnknownJuly 23, 2013 at 6:14 PM
Hi, opscode software developer here...

We recently released chef-11.6.0 and version 3.0.0 of the application cookbook that address a lot of memory leak issues. Specifically:

CHEF-3432: memory leak fixed in 11.4.2 and 10.24.2
CHEF-4065: change daemonized chefs to use client_fork by default in 11.6.0
COOK-3360: application cookbook leaks an entire run_context every run, fixed in version 3.0.0 of the cookbook.

Both chef-11.6.0 and application-3.0.0 were released earlier today.

I'm guessing that you hit COOK-3360. The change to CHEF-4065 is a defensive change to do most of the chef-client work in a temporary forked process, which is roughly equivalent to what you're doing by working out of cron, and would have also prevented COOK-3360 from being an issue.

CHEF-11 (since 11.0.0) has also prevented concurrent runs by locking the pid file. There were some issues with that pid file leaking which have been addressed by changes to the mixlib-shellout helper library (MIXLIB-12) and by using the FD_CLOEXEC flag on the lock file (both released in 11.6.0).

Also, the use of --once overrides chef's built-in splay. If you use "-s 0" without "--once" or any "-i" option you should get a random splay.

Although, there's nothing wrong with running out of cron and being paranoid... The cron daemon is exceedingly reliable...
Lamont GranquistJuly 23, 2013 at 7:05 PM
One more thing is that if you can replicate this on 11.6.0 we'd love it if you opened up a bug against it.

Hopefully the default with client_fork will mean that if its still leaking you'll never see the effects of it, so running it (in development) with 'chef-client --no-fork -s 0 -i 0' would be useful. You should see an initial climb of maybe 25% in the size of the ruby process over the first few dozen iterations, and after awhile ruby may see a bunch of heap fragmentation and decide to double its RAM use (this is just how the ruby GC behaves), but total RAM should stay around 100MB or below. If you see constant leaking and wind up with 1GB chef-clients on 11.6.0 please submit a bug report...

(and not sure why google didn't log me in correctly on that last post)

Note: Only a member of this blog may post a comment.

Thursday, July 18, 2013

Memory Leak in Opscode Chef Daemon

The Solution

Conclusion

2 comments: