Friday, August 23, 2013

Zabbix vs Graphite

Monitoring solutions have been around for some time, but I still haven't found the perfect one. I first implemented Zabbix for Lucidchart in late 2011, and, just a few months ago, I installed Graphite. I'd like to take you through my decision process so you can find the right monitoring tool for your needs.

Comparing these two products is not easy, because they were designed to do different things. Zabbix was meant to be a server monitoring solution, while Graphite is more of a data collection and reporting tool. What I'd really like to see is a merger of the two tools, but that probably won't happen anytime soon.

Zabbix

After a blind comparison, I initially chose Zabbix over Cacti, Nagios, and Ganglia. Zabbix was free, good at alerting, and supposedly scalable. 

Zabbix does a number of things really well.
  1. It alerts extremely well. The triggers and alert configuration is easy to understand, and if the server isn't overloaded, will work consistently.
  2. It is easy to set up. I'm very familiar with PHP, Apache and MySQL, which are the underlying technologies. The process goes something like this: put the files in the right place, add an apache virtual host, hit the page, and enter your database credentials.
  3. It allows a bunch of graphs to be displayed on the same screen, making nice dashboards and debugging tools. Once configured, it works like a charm.
There are a number of things that Zabbix does spectacularly horribly, however.
  1. The number and types of queries that it runs through your database are just awful. Every 6 months or so, we had to truncate our history_uint table because it had gotten too big to operate in an efficient manner. The bigger the database gets, the worse it performs. I'm not talking big, either. We had an m1.large with 46GB of data the first time we had to drop tables. I was not thrilled.
  2. The UI is probably one of the worst UIs I have ever seen. It's not intuitive at all. At times, you have to hit 'Save' on multiple screens to get the data to actually save. The terminology bombards your senses and takes hours to figure out.
  3. Zabbix only monitors servers. It can't monitor applications, unless it's one application per server. The workaround is to create a fake server that you report statistics to. The problem is that it still thinks that data resides on a server! My application data is server-agnostic, and should be treated as such.

Graphite

I came across Graphite while speaking with some presenters at AWS re:Invent in 2012. It turns out that Graphite is used widely by many companies to store and retrieve metrics for servers and applications. After installing Graphite and feeding data to it for a few weeks, I found a number of issues with our production system and was able to examine them more closely.

Here are the primary reasons I'm pleased with Graphite.
  1. It scales horizontally. I can add more servers to it, migrate the data, and it continues to work.
  2. The data can be displayed in a variety of ways. Graphite has many functions that can be applied to the data, from averaging and summing to derivatives, integrals, and deviations.
  3. The graph API allows me to make my own graphs on the fly, while encapsulating multiple servers, applications, or data points.
It's not a silver bullet though. Graphite is missing a few key features essential to our monitoring ecosystem at Lucidchart.
  1. It can't alert. I've found seyren, which is still in beta, and graphite-tattle, which lacks alerting sophistication. Most Graphite users I've spoken with feed the graphite data into some other system that specializes in alerting. These same people are dissatisfied with the process.
  2. It doesn't offer self-serve dashboards. There are plenty of dashboards within the Graphite community, but none that fit my needs. I need one that allows users to create their dashboards and share them. I don't want client-side configuration that creates them, I want it database-backed.
  3. Graphite doesn't have a data collection counterpart. StatsD is running on each box, but there is nothing that will collect the CPU, memory, IO, or any of the server metrics we want to gather. 

Why We're Moving to Graphite

We need an application that can scale horizontally, track custom metrics for servers and applications, alert, and show graphs in dashboards. Since neither of these products/communities do all of these things, we have to lower our expectations, manage multiple products, or create our own solution.

It will be easier to write alerting and displaying graphs in dashboards to Graphite than it will be to add horizontal scaling to Zabbix. Graphite's data APIs make it very simple to extract data, making it possible for a separate product that does alerting and dashboards. Zabbix, on the other hand, would need core modifications to become truly scalable.

The plan is to migrate completely to Graphite, and then write our own alert and dashboard product. Watch for the open source project!

12 comments:

  1. About:
    """
    there is nothing that will collect the CPU, memory, IO, or any of the server metrics we want to gather.
    """

    Did you tried collectd?

    ReplyDelete
  2. I was aware of it, but I didn't try it. We have a lot of custom metrics that collectd doesn't have prebuilt modules for. That was a good catch, though. The sentence should read

    """
    there is nothing that will collect the CPU, memory, IO, and all of the other server metrics we want to gather.
    """

    ReplyDelete
  3. Graphite "It doesn't offer self-serve dashboards. "

    It does. The default graphite-web frontend provides customizable dashboards which can be shared, and are saved on the server-side (in my case, in an sqlite DB)

    ReplyDelete
    Replies
    1. My apologies. You are correct. Graphite does have dashboards - awful, horrible dashboards. I hate them. I only got about 5 minutes into them when I gave up because the interface is horrible. At Lucidchart, they're still set up, but nobody uses them.

      Delete
    2. Well, at least graphite-web allows you to render the data as JSON, which you can feed into a Javascript chart, for example. Works quite well in practice.

      For the past few months I've been struggling with the same metrics collection/alerting conundrum. I've also started work on a database-backed, multi-tenant dashboard system, to make up for graphite-web's flaws.

      The biggest challenge I've faced is finding a monitoring system (even if commercial) that is well suited to using Graphite as a data source and alerting from there. I have struggled to find a solution that is both well suited to this kind of external check, and also API-driven.

      Delete
  4. Hi,

    Are you still using Graphite and what else to provide dashboards and such?

    ReplyDelete
  5. Definitely still using Graphite. Despite the lack of good dashboards and alerting, I find the tool invaluable for tracking, storing, and retrieving time series data.

    Though I haven't done a post about it yet (because I'm strapped for time lately), I've started an opensource project called "Nark" available at https://github.com/lucidchart/nark that does the dashboards and alerting for me. Let me know if you have any problems installing, configuration, or using it.

    ReplyDelete
  6. Have you tried Pandora FMS ?. It's the same approach than Zabbix, but more scalable and the GUI is another story. Even if you don't want the monitoring (alerts, thresholds, visual screens, graphs), you can easily transform the pure data recollected to send it to Graphite, it's E/R model y pretty straight forward.

    http://pandorafms.com

    ReplyDelete
  7. I forgot to mention its agents are -with big different- the most powerful agents you can found on any monitoring opensource project.

    ReplyDelete
  8. I took a look at it, and I'm not too impressed. It seems to specialize in monitoring VMWare servers. I don't have any of those.

    ReplyDelete
  9. Regarding dashboards, go to http://grafana.org. It's the dashboard you've been waiting for (and note that if you don't have a handy elasticsearch cluster, you can still use it regardless; it has gist support).

    ReplyDelete
    Replies
    1. We also use grafana, and are super pleased with it. Setting up elasticsearch is really not complicated, and one of our resourceful devs setup graphite events to be posted into elasticsearch as well. Grafana can display these events as annotations on the graphs, and then she configured the events to filter on the same parameters as the metric namespaces available on the dashboard so that only relevant events are annotated.
      This now allows us to have events such as exceptions, or detected slow queries to be annotated on the graphs in the context of the other metrics, and when hovering over the annotation, you get the details of the event such as the full exception stack trace, or the actual slow performing SQL query, etc. Really useful stuff.

      Delete