Prometheus cloudwatch-exporter Examples

Today I spent a few hours figuring out how to integrate Prometheus with AWS.

If you use AWS and want to monitor your infrastructure, you probably are going to use Cloudwatch. If you use Prometheus, you’ll probably use cloudwatch-exporter. For the most part, cloudwatch-exporter works perfectly well and, as far as I can tell, exposes all the various bits we could ever want.

I am writing up a bunch of initial configuration for our cloudwatch-exporter so that teams can monitor and alert on their infrastructure. First off, a quick note about how Cloudwatch pricing works: you fundamentally will be doing a bunch of GetMetricData requests, and these requests end up querying some number of metrics. The official cost from Amazon is:

$0.01/1,000 metrics requested using GetMetricData

This is a little abstract, but I’ll explain it more as we go.

The first AWS service I wanted to add monitoring on is SQS; SQS is AWS’ queue service and it exposes a bunch of useful metrics. I looked over the list and our notes from a meeting on how we should monitor SQS and came up with the following configuration for cloudwatch-exporter:

region: us-west-2                                  
metrics:                                                                                                           
                                            
 - aws_namespace: AWS/SQS     
   aws_metric_name: ApproximateAgeOfOldestMessage
   aws_dimensions: [QueueName]  
   aws_statistics: [Maximum]  
                        
 - aws_namespace: AWS/SQS     
   aws_metric_name: NumberOfMessagesReceived  
   aws_dimensions: [QueueName]  
   aws_statistics: [Sum]      
                        
 - aws_namespace: AWS/SQS     
   aws_metric_name: NumberOfMessagesDeleted
   aws_dimensions: [QueueName]           
   aws_statistics: [Sum]      
                        
 - aws_namespace: AWS/SQS     
   aws_metric_name: NumberOfMessagesSent
   aws_dimensions: [QueueName]                        
   aws_statistics: [Sum]    
                        
 - aws_namespace: AWS/SQS     
   aws_metric_name: ApproximateNumberOfMessagesVisible
   aws_dimensions: [QueueName]                        
   aws_statistics: [Sum]

Here’s a sample of the exported metrics above:

aws_sqs_approximate_age_of_oldest_message_maximum{job="aws_sqs",instance="",queue_name="utility",} 0.0 1555294800000
aws_sqs_approximate_number_of_messages_visible_sum{job="aws_sqs",instance="",queue_name="utility",} 0.0 1555294800000
aws_sqs_number_of_messages_received_sum{job="aws_sqs",instance="",queue_name="utility",} 0.0 1555294800000
aws_sqs_number_of_messages_received_sum{job="aws_sqs",instance="",queue_name="utility",} 0.0 1555294800000
aws_sqs_number_of_messages_deleted_sum{job="aws_sqs",instance="",queue_name="utility",} 0.0 1555294800000

The above is five metrics. If you have six SQS queues and you read them every minute, that would cost $12.96:

6 * 5 * 24 * 60 * 30 * 0.01 / 1000
12.96

We’ll be providing the above metrics (and many more for other stuff in AWS) and a guide of example alert rules that teams can use to correctly monitor their infrastructure. Here’s an obvious rule to detect when the consumer of a queue is down:

aws_sqs_approximate_age_of_oldest_message{queue_name="my-awesome-queue"} > 15 * 60

The above would alert if an item sat in your queue for over 15 minutes. Tune to your SLA.

Another common failure mode is that suddenly the queue processor can’t keep up. This needs some care configuring, but here’s one way you could express it:

delta(aws_sqs_approximate_number_of_messages_visible{queue_name="my-cool-queue"}[30m]) > 0

I’m not thrilled with the above, since a huge spike in inbound messages could make it fire, but I intend to experiment and bounce ideas off coworkers.

As a side note I think it’s important to mention that the official cloudwatch-exporter has some performance issues due to being implemented in terms of GetMetricStatistics instead of GetMetricData. The price (if I’m reading correctly) should be the same, but it is, at least for us, pretty slow.

(The following includes affiliate links.)

If you want to learn more about prometheus, you might check out Prometheus: Up & Running.

Another option, which I have only glanced at so far, is Monitoring with Prometheus.

I have only spent a little time glancing at these two books and both of them have good stuff in them.

Posted Mon, Apr 15, 2019

If you're interested in being notified when new posts are published, you can subscribe here; you'll get an email once a week at the most.