AWS Adventures, Part 4 – CloudWatch network monitoring

CloudWatch is a great concept — super-easy to configure and inexpensive. And at first glance, it actually looks pretty nice. But after I spent about 30 minutes with it, I realized it wasn’t easy to use. The units used are especially hard to interpret. This is my best attempt to explain what the network values mean.

Start by looking at per-instance network values. In the AWS console, go to CloudWatch. Type “ec2” in the metric search box, and click on “EC2 > Per-Instance Metrics”. In the Search box, paste in the instance ID of one of your EC2 instances, and then select “NetworkOut”.

Values that are recorded by CloudWatch for NetworkIn and NetworkOut are just the number of bytes output since the last reading. Looking at network traffic in units of “Bytes” (as opposed to “Bits per second” or even “Bytes per second”) is a little like saying “the car went 1 mile” and expecting the listener to understand how fast that was.

First, you have to know what the interval is between readings. By default, CloudWatch metrics are recorded once every 5 minutes unless you pay for detailed metrics (this is true for EC2, but you need to consult the documentation for other services — EFS records every minute, for example, but S3 reports basic metrics once per day by default — you really need to do a deep dive into each metric to understand its frequency, which is another reason CloudWatch is hard to use).

So if you see value of 300,000,000 Bytes for NetworkOut (without detailed metrics enabled), you have to divide by 300 (300 seconds in 5 minutes) to get the Bytes per second value of 1,000,000. If you want a nice familiar metric like Mbps, you need to multiply by 8 and then divide by (1024 * 1024), for a value of 7.6 Mbps.

If you enable detailed metrics, you’ll get one value each minute, so the math changes a little bit. Instead of dividing by 300, you divide by 60. But you should note that the CloudWatch graph by default plots one point for every 5 minutes, unless you change that to 1 minute under the “Graphed Metrics” tab.

So what do the different “Statistic” types do? If your graph is set up so that you have one underlying data point for every point on the plot, then they don’t really do much of anything. They really only matter when you have more than one data point factored into each point on the plot. For example, if you have detailed metrics enabled, but your graph is configured with a period of 5 minutes, you have 5 data points being factored into each point in the plot. If you choose “Average”, the plotted point will be the average of the 5 data points. If you choose “Maximum”, the plotted point will be the highest value of the 5 data points. “Sum” will be the sum of the 5 data points (which would effectively be the same as if you had basic metrics measuring every 5 minutes and plotting with a period of 5 minutes).

Things get more interesting if you’re looking at metrics across a group of instances. If you have an auto-scaling group with more than one EC2 instance, you can aggregate metrics across those instances. In the AWS console, go to CloudWatch. Type “ec2” in the metric search box, and click on “EC2 > By Auto Scaling Group”. In the Search box, paste in the name of the auto-scaling group, and then select “NetworkOut”.

Now that you’re looking at numbers for more than one instance, the statistic selected makes a big difference.

If you have basic metrics, and you’re plotting one value every 5 minutes, the “Average” statistic will average the data points across the instances in your auto-scaling group. The “Maximum” statistic will give you the network volume for the instance that moved the most bytes. “Sum” will total up all the traffic across the instances, giving you a measure of the total volume of traffic handled by the auto-scaling group.

If you have detailed metrics, things can get a little complicated. “Average” is the average across the 5 data points across the instances of the auto-scaling group. “Maximum” is the largest data point recorded during a single minute in the 5-minute period on one of the instances. “Sum” is the sum of all 5 data points on all of the instances in the auto-scaling group.

Imagine you have 4 EC2 instances in your auto-scaling group. Depending on the metrics you’re using and the period of your graph, the number of data points taken into account for each point on the graph will vary.

Metrics Period Num data points
Basic (5-min) 5 min 4
Detailed (1-min) 5 min 20
Detailed (1-min) 1 min 4

While this isn’t too bad once you get the hang of it, the fact remains that the network unit used by CloudWatch is the Byte. I’m not sure I’ll ever be able to look at a glance at a bytecount (with its implicit measurement interval) and determine whether an instance is moving “a lot” of traffic. I think in terms of Mbps. I know that an instance moving 800Mbps is doing a pretty nice amount of traffic. But converting a value of “Bytes” to Mbps is too much, especially when one instance might be using Basic Monitoring and another might be using Detailed Monitoring.

You might want to look to third-party solutions to really maximize your CloudWatch visualizations. I have looked at a number of different solutions, and I have settled on running our own InfluxDB database with Grafana for the front end. We use custom code to query CloudWatch and feed the values into InfluxDB, converting the “units-since-last-measured” metrics into sensible “units-per-second”. With the InfluxDB/Grafana combination, it’s vital that you make this conversion so that you can do things like take the mean of several values for rolling up your data into long-term retention policies.

Leave a Reply

Your email address will not be published. Required fields are marked *