AWS Adventures, part 5 – monitoring infrastructure

I’m addicted to performance metrics on my server infrastructure. For years in our on-prem environment, we monitored thousands of data points using ganglia, nagios, and zabbix. In our new AWS infrastructure, we thought we’d look for some more sophisticated options.

This article details our experiences rolling out a monitoring infrastructure using the following pieces of software:

  • netdata for real-time monitoring of individual hosts and for sending host-level performance metrics to the time series database
  • InfluxDB as our time-series database for archiving performance metrics
  • Grafana for customized dashboards for analyzing the time series data
  • Kapacitor for alerting based on the time series data
  • Telegraf for collection of data from AWS-managed services where we can’t run netdata (e.g., MySQL performance metrics from an Aurora stack)
  • custom software to collect CloudWatch metrics from specific tagged AWS resources, transform it, and ship it to InfluxDB

When you look at our implementation, we’re using most of the TICK stack, but we felt that Grafana provides a more sophisticated interface than Chronograf, so we opted to use it instead.

Here’s a high-level view of our monitoring stack:

Alternatives

We really had hoped to use a packaged solution for our monitoring and alerting. We didn’t really want one more technology stack to maintain in our cloud infrastructure. So we explored some alternatives before going down the road of implementing our own.

We started out looking at CloudWatch for our reporting and alerting, but honestly, its UI is terrible. Metrics are hard to interpret. It can be difficult to tell at a glance if you’re looking at rates versus discrete event counts. It’s not very aesthetically pleasing when you compare it somehting like Grafana. And adding custom metrics into it can quickly get expensive.

In addition, if you’re using CloudWatch alarms to scale ASGs up and down, having alarms in alarm state is normal (like when your traffic is low, you’ll have a scale down alarm that will be in alarm state). Amazon gets a lot of things right with their platform, but using alarms like this is crazy. I’m not putting real alarms into that system when the dashboard has a red light turned on 24×7 even when traffic is normal.

Third party options can be expensive, and can take a lot of work to set up properly. We took a look at Librato, and even though it was super easy to pull in CloudWatch numbers, I’m not sure they were being interpreted properly in the graphs. Units didn’t seem quite right.

The more we got into it, the more we started to think we would do this ourselves using open-source tools. We just needed to keep the overall cost manageable. I definitely wanted to do this for less than $200/month. I would prefer to do it for around $100/month, but that might not be realistic.

Our AWS architecture uses the classic two-AZ design. We wanted our metrics to be highly available; we considered making the system single-AZ, and just accepting the fact that we would lose metrics if we lost an AZ. But loss of an AZ would be serious event, and you’d probably have a lot of questions about how your systems performed during such a dramatic change of conditions. I think it would be short-sighted to set yourself up to not have metrics during this kind of failure.

Storage

We built out two monitoring EC2 instances, each with a dedicated 120GB EBS volume for data storage. There are lots of options in the time series database space, but we opted to use InfluxDB because we liked its data tagging capabilities. We would have liked to use a true influxdb cluster, but that’s an Enterprise feature, and we knew that would make it more difficult to meet our budgetary constraints.

So instead of a cluster, we use two independent InfluxDB instances, and we send all our data to both. This means that there is a chance that some data will make it to one InfluxDB instance and not the other, but in general, this hasn’t been a big problem for us, and we’re willing to accept this for the cost tradeoffs.

The monitoring instances sit in a multi-AZ auto-scaling group with a min/max of 2 instances. This ensures we have exactly one instance in each AZ. Each instance is assigned an elastic IP address so that our data collection software knows where to find the Influx databases.

The instances have to be smart enough in their initialization to use the proper elastic IP address and to mount the appropriate EBS volume. One volume is designated for each AZ, and the init scripts know how to mount the right one. We also have to make sure that if we replace the instances, we let the previous instance shut down completely before we bring up a new instance (we can’t have two hosts mounting the same EBS volume).

Another unique characteristic of InfluxDB is its concept of retention policies. Some time series databases automatically roll up your data into lower-resolution values over time, allowing you to keep data over a longer period of time without the huge storage requirements of storing high resolution data for that entire time window. InfluxDB doesn’t do this automatically. You have to define retention policies and then set up continuous queries to perform the averaging from one retention policy to the next. You have to do this explicitly for every measurement you put into InfluxDB. We wrote software to build these automatically for us as we send new measurements into InfluxDB. It’s nice to be able to just send arbitrary values to InfluxDB without a lot of new back-end configuration. Compare that to the work it took to bring new values into something like Zabbix, and this is a welcome change.

As an example, I could start capturing a new value from like “Apache requests per second” (using any software that is capable of calculating this value), send it to InfluxDB with a measurement name (e.g. “apache.requests_per_second”), tag it with some host name information so I can investigate individual servers if I need to, and the data will be in InfluxDB, ready to chart and alert. I didn’t have to explicitly define the value in the InfluxDB. But the automatic generation of continuous queries is key; without doing that, I’d have to define a set of continuous queries every time I start to record a new measurement.

Collecting

So the question was what client software to install on our instances to collect the data and send to InfluxDB? There are lots of options like telegraf, collectd, etc. We opted to take advantage of the streaming capabilities of netdata. Not only does netdata provide a fantastic real-time view into an individual server’s performance, it can stream out to another time series database. Not to mention, netdata is super lightweight, which is critical when you’re trying to cost-optimize your cloud environment. You don’t want your monitoring software to be the reason you have to upgrade to a larger instance size, costing you more money.

One copy of netdata can only send to one backend, but that may change in the future. Until then, we get around this by having two netdata executables running. The first instance gathers the metrics, provides a web based UI, streams to InfluxDB, and relays the data to the second instance. The second instance only sends to InfluxDB. In this way, we can send the data to both of our InfluxDB instances.

Netdata is very thorough; it captures hundreds of performance metrics. I believe we were getting about 700-800 individual values with the default configuration. We looked at the cost of archiving all that data, and we decided we didn’t need a lot of it. With some careful configuration, we were able to trim this down to about 300 metrics.

Netdata gets us host-level data from the linux operating system running on our instances. But there are other kinds of data that we needed to ingest and analyze.

Services like Aurora and Elasticache don’t let us install our own client software on the hosts (nor do I want to!). So we needed something to connect to those services and retrieve performance metrics. Telegraf handles this out of the box. We had to do a lot of custom configuration to get just the set of MySQL metrics we wanted, but it wasn’t too bad.

In addition, there are lots of proprietary AWS services, like Elastic Load Balancing and CloudFront that don’t have standard metrics reporting that can be handled by Telegraf. For those, we wrote our own code to query CloudFront and send values to InfluxDB. There were a few things to watch out for:

  • variable latency – the delay on CloudWatch values is different for every metric. For some metrics, you might have numbers a few minutes behind real time. For others, it could be hours (EFS in particular). We had to build the software with some configurability to handle this extreme variability (if we only looked back 10 minutes, for example, we’d never get any EFS values)
  • partial reporting – some CloudWatch values come in with preliminary numbers and later get updates. The value for a given data point can change; the software had to be built to recognize these updated values and replace them in the InfluxDB
  • rate conversion – many CloudWatch metrics are discrete counts (like “Bytes Transferred” might mean “number of bytes transferred since the last reading”). This is not very helpful for analysis for people like me who think in megabits per second or gigabits per second. Also, because we are using auto-generated continuous queries in InfluxDB to reduce data resolution, we need consistency between measurements. If you’re reporting a data rate, you can average the values to reduce the resolution. But if you are reporting absolute counts per interval, you can’t average; you would need a sum. We wanted all our continuous queries to perform the same averaging operation so that they could be auto-generated. So it was really important to look very closely at each cloudwatch metric we were gathering and determine if it needed a rate conversion.

Reporting/analysis

This is the really fun part of the whole thing. Seeing your systems described in beautiful interactive graphs really makes them come to life, and gives you tons of insight into what is happening on those systems, and what needs to be improved.

As mentioned before, netdata provides us a very nice real-time UI on port 19999 on each host. Graphs are updated once a second, and they are interactive.

For looking at longer-term trends, rolling up multiple hosts into one dashboard, or looking at data from sources other than netdata, we use grafana. Its web UI runs on port 3000 by default. Its graphs are very interactive; you can zoom in to look closely at one particular chart, you can drag to zoom in on a specific time range, and you can toggle specific data points on and off. You can build highly sophisticated dashboards and have them cycle automatically.

Here’s a screenshot of our video streaming dashboard during the August 2017 total eclipse.

I should mention that we use a fantastic CLI tool, wizzy, to manage our dashboards. It imports/exports the dashboards in JSON format, perfect for committing to our git repo.

Alerting

We rely on alerting to identify problems in the infrastructure. If we’re pushing the CPU too hard on a particular instance, we may want to consider using a larger instance type or scaling horizontally to add more instances to handle the workload.

Given that we’re using InfluxDB, kapacitor was a natural option for alerting. It streams values from InfluxDB as they are ingested, applying lambda expressions to do things like filter by tag and combine values mathematically. It can do anomaly detection, alerting when something deviates too much from the norm. Configuration is a little tricky, requiring you to author “TICK scripts”. But I think this complexity is necessary. alerts in other tools, like zabbix, can be quite complex to configure, and having the logic in a script lends itself better to version control than doing inside a web interface like in zabbix.

In addition to being able to send email, Kapacitor integrates nicely with a wide variety of notification services, like PagerDuty and VictorOps (the platform we’re using). It also has easy Slack integration, if you want to send a copy of all your notifications to a slack channel for your team to monitor.

Cost

To understand our costs, you have to know a little about our environment to get an idea of how much we’re collecting and storing.

  • monitoring about 40-50 EC2 instances with netdata, with about 300 unique measurements
  • netdata sends values to influxdb once every 10s
  • retain 10s data for 10 days
  • retain 1m data for 120 days
  • retain 15m data for 2 years
  • collecting CloudWatch metrics from about 10 AWS services, with multiple resources per service, and about 15 measurements per service
  • collecting performance metrics from one Elasticache cluster and two Aurora clusters (about 20-30 measurements each)

Costs:

  • two t2.large instances: $84.68/mo
  • ELB for the grafana UI: $18.76/mo
  • two 120GB EBS volumes for InfluxDB storage: $24.00/mo
  • additional CloudWatch metrics and API calls: $38.00/mo

Total cost: $165.44/mo. I am happy with this cost; I don’t think any packaged services could touch this in terms of data resolution, coverage, and retention. I hope this gives you some ideas on building your own monitoring stack in AWS!

Leave a Reply

Your email address will not be published. Required fields are marked *