Prometheus cloudwatch-exporter Examples
Today I spent a few hours figuring out how to integrate Prometheus with AWS.
If you use AWS and want to monitor your infrastructure, you probably are going
to use Cloudwatch. If you use Prometheus,
you’ll probably use
cloudwatch-exporter. For
the most part, cloudwatch-exporter
works perfectly well and, as far as I can
tell, exposes all the various bits we could ever want.
I am writing up a bunch of initial configuration for our cloudwatch-exporter
so that teams can monitor and alert on their infrastructure. First off, a quick
note about how Cloudwatch pricing
works: you fundamentally will be
doing a bunch of GetMetricData requests, and these requests end up querying some
number of metrics. The official cost from Amazon is:
$0.01/1,000 metrics requested using GetMetricData
This is a little abstract, but I’ll explain it more as we go.
The first AWS service I wanted to add monitoring on is SQS; SQS is AWS’ queue
service and it exposes a bunch of useful
metrics.
I looked over the list and our notes from a meeting on how we should monitor SQS
and came up with the following configuration for cloudwatch-exporter
:
region: us-west-2
metrics:
- aws_namespace: AWS/SQS
aws_metric_name: ApproximateAgeOfOldestMessage
aws_dimensions: [QueueName]
aws_statistics: [Maximum]
- aws_namespace: AWS/SQS
aws_metric_name: NumberOfMessagesReceived
aws_dimensions: [QueueName]
aws_statistics: [Sum]
- aws_namespace: AWS/SQS
aws_metric_name: NumberOfMessagesDeleted
aws_dimensions: [QueueName]
aws_statistics: [Sum]
- aws_namespace: AWS/SQS
aws_metric_name: NumberOfMessagesSent
aws_dimensions: [QueueName]
aws_statistics: [Sum]
- aws_namespace: AWS/SQS
aws_metric_name: ApproximateNumberOfMessagesVisible
aws_dimensions: [QueueName]
aws_statistics: [Sum]
Here’s a sample of the exported metrics above:
aws_sqs_approximate_age_of_oldest_message_maximum{job="aws_sqs",instance="",queue_name="utility",} 0.0 1555294800000
aws_sqs_approximate_number_of_messages_visible_sum{job="aws_sqs",instance="",queue_name="utility",} 0.0 1555294800000
aws_sqs_number_of_messages_received_sum{job="aws_sqs",instance="",queue_name="utility",} 0.0 1555294800000
aws_sqs_number_of_messages_received_sum{job="aws_sqs",instance="",queue_name="utility",} 0.0 1555294800000
aws_sqs_number_of_messages_deleted_sum{job="aws_sqs",instance="",queue_name="utility",} 0.0 1555294800000
The above is five metrics. If you have six SQS queues and you read them every minute, that would cost $12.96:
6 * 5 * 24 * 60 * 30 * 0.01 / 1000
12.96
We’ll be providing the above metrics (and many more for other stuff in AWS) and a guide of example alert rules that teams can use to correctly monitor their infrastructure. Here’s an obvious rule to detect when the consumer of a queue is down:
aws_sqs_approximate_age_of_oldest_message{queue_name="my-awesome-queue"} > 15 * 60
The above would alert if an item sat in your queue for over 15 minutes. Tune to your SLA.
Another common failure mode is that suddenly the queue processor can’t keep up. This needs some care configuring, but here’s one way you could express it:
delta(aws_sqs_approximate_number_of_messages_visible{queue_name="my-cool-queue"}[30m]) > 0
I’m not thrilled with the above, since a huge spike in inbound messages could make it fire, but I intend to experiment and bounce ideas off coworkers.
As a side note I think it’s important to mention that the official
cloudwatch-exporter
has some performance issues due to being implemented in
terms of GetMetricStatistics instead of GetMetricData. The price (if I’m
reading correctly) should be the same, but it is, at least for us, pretty slow.
(The following includes affiliate links.)
If you want to learn more about prometheus, you might check out Prometheus: Up & Running.
Another option, which I have only glanced at so far, is Monitoring with Prometheus.
I have only spent a little time glancing at these two books and both of them have good stuff in them.
Posted Mon, Apr 15, 2019If you're interested in being notified when new posts are published, you can subscribe here; you'll get an email once a week at the most.