I’m addicted to performance metrics on my server infrastructure. For years in our on-prem environment,
we monitored thousands of data points using ganglia, nagios, and zabbix. In our new AWS infrastructure,
we thought we’d look for some more sophisticated options.
This article details our experiences rolling out
a monitoring infrastructure using the following pieces of software:
- netdata for real-time monitoring of individual hosts and for sending host-level performance metrics to the time series database
- InfluxDB as our time-series database for archiving performance metrics
- Grafana for customized dashboards for analyzing the time series data
- Kapacitor for alerting based on the time series data
- Telegraf for collection of data from AWS-managed services where we can’t run netdata (e.g., MySQL performance metrics from an Aurora stack)
- custom software to collect CloudWatch metrics from specific tagged AWS resources, transform it, and ship it to InfluxDB
When you look at our implementation, we’re using most of the TICK stack, but we felt that Grafana provides a
more sophisticated interface than Chronograf, so we opted to use it instead.
Here’s a high-level view of our monitoring stack:
We really had hoped to use a packaged solution for our monitoring and alerting. We didn’t really want one more
technology stack to maintain in our cloud infrastructure. So we explored some alternatives before going down the
road of implementing our own.
We started out looking at CloudWatch for our reporting and alerting, but honestly, its UI is terrible. Metrics are
hard to interpret. It can be difficult to tell at a glance if you’re looking at rates versus discrete event counts.
It’s not very aesthetically pleasing when you compare it somehting like Grafana. And adding custom metrics into it
can quickly get expensive.
if you’re using CloudWatch alarms to scale ASGs up and down, having alarms in alarm state is normal (like when your
traffic is low, you’ll have a scale down alarm that will be in alarm state). Amazon gets a lot of things right with
their platform, but using alarms like this is crazy. I’m not putting real alarms into that system when the dashboard
has a red light turned on 24×7 even when traffic is normal.
Third party options can be expensive, and can take a lot of work to set up properly. We took a look at Librato, and
even though it was super easy to pull in CloudWatch numbers, I’m not sure they were being interpreted properly in
the graphs. Units didn’t seem quite right.
The more we got into it, the more we started to think we would do this ourselves using open-source tools. We just
needed to keep the overall cost manageable. I definitely wanted to do this for less than $200/month. I would prefer
to do it for around $100/month, but that might not be realistic.
Our AWS architecture uses the classic two-AZ design. We wanted our metrics to be highly available; we considered
making the system single-AZ, and just accepting the fact that we would lose metrics if we lost an AZ. But loss of
an AZ would be serious event, and you’d probably have a lot of questions about how your systems performed during
such a dramatic change of conditions. I think it would be short-sighted to set yourself up to not have metrics
during this kind of failure.
We built out two monitoring EC2 instances, each with a dedicated 120GB EBS volume for data storage. There are
lots of options in the time series database space, but we opted to
use InfluxDB because we liked its data tagging capabilities. We would
have liked to use a true influxdb cluster, but that’s an Enterprise feature, and we knew that would make it
more difficult to meet our budgetary constraints.
So instead of a cluster, we use two independent InfluxDB instances, and we send all our data to both. This means
that there is a chance that some data will make it to one InfluxDB instance and not the other, but in general,
this hasn’t been a big problem for us, and we’re willing to accept this for the cost tradeoffs.
The monitoring instances sit in a multi-AZ auto-scaling group with a min/max of 2 instances. This ensures we have
exactly one instance in each AZ. Each instance is assigned an elastic IP address so that our data collection software
knows where to find the Influx databases.
The instances have to be smart enough in their initialization to use the proper elastic IP address and to mount the
appropriate EBS volume. One volume is designated for each AZ, and the init scripts know how to mount the right one.
We also have to make sure that if we replace the instances, we let the previous instance shut down completely before
we bring up a new instance (we can’t have two hosts mounting the same EBS volume).
Another unique characteristic of InfluxDB is its concept of retention policies. Some time series databases automatically
roll up your data into lower-resolution values over time, allowing you to keep data over a longer period of time without
the huge storage requirements of storing high resolution data for that entire time window. InfluxDB doesn’t do this
automatically. You have to define retention policies and then set up continuous queries to perform the averaging from
one retention policy to the next. You have to do this explicitly for every measurement you put into InfluxDB. We
wrote software to build these automatically for us as we send new measurements into InfluxDB. It’s nice to be able to
just send arbitrary values to InfluxDB without a lot of new back-end configuration. Compare that to the work it took
to bring new values into something like Zabbix, and this is a welcome change.
As an example, I could start capturing a new value from like “Apache requests per second” (using any software that is
capable of calculating this value), send it to InfluxDB with a measurement name (e.g. “apache.requests_per_second”), tag
it with some host name information so I can investigate individual servers if I need to, and the data will be in InfluxDB,
ready to chart and alert. I didn’t have to explicitly define the value in the InfluxDB. But the automatic generation of
continuous queries is key; without doing that, I’d have to define a set of continuous queries every time I start to record
a new measurement.
So the question was what client software to install on our instances to collect the data and send to InfluxDB? There
are lots of options like telegraf, collectd, etc. We opted to take advantage of the streaming capabilities of netdata.
Not only does netdata provide a fantastic real-time view into an individual server’s performance, it can stream out
to another time series database. Not to mention, netdata is super lightweight, which is critical when you’re trying
to cost-optimize your cloud environment. You don’t want your monitoring software to be the reason you have to upgrade to
a larger instance size, costing you more money.
One copy of netdata can only send to one backend, but that
may change in the future. Until then, we get around this
by having two netdata executables running. The first instance gathers the metrics, provides a web based UI, streams to
InfluxDB, and relays the data to the second instance. The second instance only sends to InfluxDB. In this way, we can
send the data to both of our InfluxDB instances.
Netdata is very thorough; it captures hundreds of performance metrics. I believe we were getting about 700-800 individual
values with the default configuration. We looked at the cost of archiving all that data, and we decided we didn’t
need a lot of it. With some careful configuration, we were able to trim this down to about 300 metrics.
Netdata gets us host-level data from the linux operating system running on our instances. But there are other kinds of
data that we needed to ingest and analyze.
Services like Aurora and Elasticache don’t let us install our own client software on the hosts (nor do I want to!). So we
needed something to connect to those services and retrieve performance metrics. Telegraf handles this out of the box. We
had to do a lot of custom configuration to get just the set of MySQL metrics we wanted, but it wasn’t too bad.
In addition, there are lots of proprietary AWS services, like Elastic Load Balancing and CloudFront that don’t have
standard metrics reporting that can be handled by Telegraf. For those, we wrote our own code to query CloudFront
and send values to InfluxDB. There were a few things to watch out for:
- variable latency – the delay on CloudWatch values is different for every metric. For some metrics, you might have
numbers a few minutes behind real time. For others, it could be hours (EFS in particular). We had to build the software
with some configurability to handle this extreme variability (if we only looked back 10 minutes, for example, we’d never
get any EFS values)
- partial reporting – some CloudWatch values come in with preliminary numbers and later get updates. The value for a given data point
can change; the software had to be built to recognize these updated values and replace them in the InfluxDB
- rate conversion – many CloudWatch metrics are discrete counts (like “Bytes Transferred” might mean “number of bytes transferred
since the last reading”). This is not very helpful for analysis for people like me who think in megabits per second
or gigabits per second. Also, because we are using auto-generated continuous queries in InfluxDB to reduce
data resolution, we need consistency between measurements. If you’re reporting a data rate, you can average
the values to reduce the resolution. But if you are reporting absolute counts per interval, you can’t average;
you would need a sum. We wanted all our continuous queries to perform the same averaging operation so that
they could be auto-generated. So it was really important to look very closely at each cloudwatch metric
we were gathering and determine if it needed a rate conversion.
This is the really fun part of the whole thing. Seeing your systems described in beautiful interactive graphs really
makes them come to life, and gives you tons of insight into what is happening on those systems, and what needs to be
As mentioned before, netdata provides us a very nice real-time UI on port 19999 on each host. Graphs are updated once
a second, and they are interactive.
For looking at longer-term trends, rolling up multiple hosts into one dashboard, or looking at data from sources other
than netdata, we use grafana. Its web UI runs on port 3000 by default. Its graphs are very interactive; you can zoom
in to look closely at one particular chart, you can drag to zoom in on a specific time range, and you can toggle specific
data points on and off. You can build highly sophisticated dashboards
and have them cycle automatically.
Here’s a screenshot of our video streaming dashboard during the August 2017 total eclipse.
I should mention that we use a fantastic CLI tool, wizzy, to
manage our dashboards. It imports/exports the dashboards in JSON format, perfect for committing to our git repo.
We rely on alerting to identify problems in the infrastructure. If we’re pushing the CPU too hard on a particular instance,
we may want to consider using a larger instance type or scaling horizontally to add more instances to handle the workload.
Given that we’re using InfluxDB, kapacitor was a natural option for alerting. It streams values from InfluxDB as they
are ingested, applying lambda expressions to do things like filter by tag and combine values mathematically. It can
do anomaly detection, alerting when something deviates too much from the norm. Configuration is a little tricky,
requiring you to author “TICK scripts”. But I think this complexity is necessary. alerts in other tools, like zabbix, can
be quite complex to configure, and having the logic in a script lends itself better to version control than doing inside
a web interface like in zabbix.
In addition to being able to send email, Kapacitor integrates nicely with a wide variety of notification services,
like PagerDuty and VictorOps (the platform we’re using). It also has easy Slack integration, if you want to send a copy
of all your notifications to a slack channel for your team to monitor.
To understand our costs, you have to know a little about our environment to get an idea of how much we’re collecting and storing.
- monitoring about 40-50 EC2 instances with netdata, with about 300 unique measurements
- netdata sends values to influxdb once every 10s
- retain 10s data for 10 days
- retain 1m data for 120 days
- retain 15m data for 2 years
- collecting CloudWatch metrics from about 10 AWS services, with multiple resources per service, and about 15 measurements per service
- collecting performance metrics from one Elasticache cluster and two Aurora clusters (about 20-30 measurements each)
- two t2.large instances: $84.68/mo
- ELB for the grafana UI: $18.76/mo
- two 120GB EBS volumes for InfluxDB storage: $24.00/mo
- additional CloudWatch metrics and API calls: $38.00/mo
Total cost: $165.44/mo. I am happy with this cost; I don’t think any packaged services could touch this in terms of data
resolution, coverage, and retention. I hope this gives you some ideas on building your own monitoring stack in AWS!