Architected Availability: Zabbix vs Graphite

Friday, August 23, 2013

Zabbix vs Graphite

Monitoring solutions have been around for some time, but I still haven't found the perfect one. I first implemented Zabbix for Lucidchart in late 2011, and, just a few months ago, I installed Graphite. I'd like to take you through my decision process so you can find the right monitoring tool for your needs.

Comparing these two products is not easy, because they were designed to do different things. Zabbix was meant to be a server monitoring solution, while Graphite is more of a data collection and reporting tool. What I'd really like to see is a merger of the two tools, but that probably won't happen anytime soon.

Zabbix

After a blind comparison, I initially chose Zabbix over Cacti, Nagios, and Ganglia. Zabbix was free, good at alerting, and supposedly scalable.

Zabbix does a number of things really well.

It alerts extremely well. The triggers and alert configuration is easy to understand, and if the server isn't overloaded, will work consistently.
It is easy to set up. I'm very familiar with PHP, Apache and MySQL, which are the underlying technologies. The process goes something like this: put the files in the right place, add an apache virtual host, hit the page, and enter your database credentials.
It allows a bunch of graphs to be displayed on the same screen, making nice dashboards and debugging tools. Once configured, it works like a charm.

There are a number of things that Zabbix does spectacularly horribly, however.

The number and types of queries that it runs through your database are just awful. Every 6 months or so, we had to truncate our history_uint table because it had gotten too big to operate in an efficient manner. The bigger the database gets, the worse it performs. I'm not talking big, either. We had an m1.large with 46GB of data the first time we had to drop tables. I was not thrilled.
The UI is probably one of the worst UIs I have ever seen. It's not intuitive at all. At times, you have to hit 'Save' on multiple screens to get the data to actually save. The terminology bombards your senses and takes hours to figure out.
Zabbix only monitors servers. It can't monitor applications, unless it's one application per server. The workaround is to create a fake server that you report statistics to. The problem is that it still thinks that data resides on a server! My application data is server-agnostic, and should be treated as such.

Graphite

I came across Graphite while speaking with some presenters at AWS re:Invent in 2012. It turns out that Graphite is used widely by many companies to store and retrieve metrics for servers and applications. After installing Graphite and feeding data to it for a few weeks, I found a number of issues with our production system and was able to examine them more closely.

Here are the primary reasons I'm pleased with Graphite.

It scales horizontally. I can add more servers to it, migrate the data, and it continues to work.
The data can be displayed in a variety of ways. Graphite has many functions that can be applied to the data, from averaging and summing to derivatives, integrals, and deviations.
The graph API allows me to make my own graphs on the fly, while encapsulating multiple servers, applications, or data points.

It's not a silver bullet though. Graphite is missing a few key features essential to our monitoring ecosystem at Lucidchart.

It can't alert. I've found seyren, which is still in beta, and graphite-tattle, which lacks alerting sophistication. Most Graphite users I've spoken with feed the graphite data into some other system that specializes in alerting. These same people are dissatisfied with the process.
It doesn't offer self-serve dashboards. There are plenty of dashboards within the Graphite community, but none that fit my needs. I need one that allows users to create their dashboards and share them. I don't want client-side configuration that creates them, I want it database-backed.
Graphite doesn't have a data collection counterpart. StatsD is running on each box, but there is nothing that will collect the CPU, memory, IO, or any of the server metrics we want to gather.

Why We're Moving to Graphite

We need an application that can scale horizontally, track custom metrics for servers and applications, alert, and show graphs in dashboards. Since neither of these products/communities do all of these things, we have to lower our expectations, manage multiple products, or create our own solution.

It will be easier to write alerting and displaying graphs in dashboards to Graphite than it will be to add horizontal scaling to Zabbix. Graphite's data APIs make it very simple to extract data, making it possible for a separate product that does alerting and dashboards. Zabbix, on the other hand, would need core modifications to become truly scalable.

The plan is to migrate completely to Graphite, and then write our own alert and dashboard product. Watch for the open source project!

12 comments:

AnonymousSeptember 14, 2013 at 5:56 AM
About:
"""
there is nothing that will collect the CPU, memory, IO, or any of the server metrics we want to gather.
"""

Did you tried collectd?
ReplyDelete
Replies
Matthew BarlockerSeptember 15, 2013 at 8:48 AM
I was aware of it, but I didn't try it. We have a lot of custom metrics that collectd doesn't have prebuilt modules for. That was a good catch, though. The sentence should read

"""
there is nothing that will collect the CPU, memory, IO, and all of the other server metrics we want to gather.
"""
ReplyDelete
Replies
AnonymousOctober 10, 2013 at 4:56 AM
Graphite "It doesn't offer self-serve dashboards. "

It does. The default graphite-web frontend provides customizable dashboards which can be shared, and are saved on the server-side (in my case, in an sqlite DB)
ReplyDelete
Replies
UnknownJanuary 28, 2014 at 1:02 PM
Hi,

Are you still using Graphite and what else to provide dashboards and such?
ReplyDelete
Replies
Matthew BarlockerJanuary 28, 2014 at 9:34 PM
Definitely still using Graphite. Despite the lack of good dashboards and alerting, I find the tool invaluable for tracking, storing, and retrieving time series data.

Though I haven't done a post about it yet (because I'm strapped for time lately), I've started an opensource project called "Nark" available at https://github.com/lucidchart/nark that does the dashboards and alerting for me. Let me know if you have any problems installing, configuration, or using it.
ReplyDelete
Replies
Monitoring dudeJanuary 29, 2014 at 2:48 PM
Have you tried Pandora FMS ?. It's the same approach than Zabbix, but more scalable and the GUI is another story. Even if you don't want the monitoring (alerts, thresholds, visual screens, graphs), you can easily transform the pure data recollected to send it to Graphite, it's E/R model y pretty straight forward.

http://pandorafms.com
ReplyDelete
Replies
AnonymousJanuary 29, 2014 at 2:49 PM
I forgot to mention its agents are -with big different- the most powerful agents you can found on any monitoring opensource project.
ReplyDelete
Replies
Matthew BarlockerJanuary 29, 2014 at 8:04 PM
I took a look at it, and I'm not too impressed. It seems to specialize in monitoring VMWare servers. I don't have any of those.
ReplyDelete
Replies
UnknownApril 24, 2014 at 2:34 PM
Regarding dashboards, go to http://grafana.org. It's the dashboard you've been waiting for (and note that if you don't have a handy elasticsearch cluster, you can still use it regardless; it has gist support).
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.