Monitoring Service start/stop in Upstart
Recently at ZipRecruiter I implemented a tool to ensure that we know if some service is crashlooping. It was really easy thanks to Upstart but it took almost a whole day to get just right.
I called the tool cerberus; you could call it anything though. Fundamentally it
watches all starting
or stopping
events that go through Upstart, excluding
itself. Here’s the guts of the implementation:
In /etc/init/cerberus.conf
:
description "Monitor service events"
start on (starting JOB!="cerberus" JOB!="startpar-bridge" INSTANCE!="cerberus*") \
or (stopping JOB!="cerberus" JOB!="startpar-bridge" INSTANCE!="cerberus*")
exec perl -E 'say localtime . qq( $ARGV[0] $_) for split /\s+/, $ARGV[1]' "$JOB" "$UPSTART_EVENTS"
This would log something like Sun Sep 24 19:45:00 2017 www stopping
.
We are actually sending this data to our stats server so we can build monitors
based on it, but that’s the basic idea. A frustrating side note is that, as a
feature, Upstart only lets a single process run for a given job. When we
initially did this I used the event starting
and stopping
instead of
started
and stopped
. The upshot was that if you ran restart www
it would
only record the first event, because the second one couldn’t start as it
happened at the same time. For some reason (I really am not sure why) the
starting
/stopping
version doesn’t have that problem.
Additionally there is still a race condition. If two services advertise that
they are either starting
or stopping
at the same time, we will only hear
about one of them. Because I made this to monitor crashlooping I am not too
worried about that. If you wanted to be 100% confident you got all the events
you could build a watchdog per service, but that sounds easy to get wrong to me.
This change allowed us to remove the limit on respawning and configure all
services to respawn forever. Worst case there would be a syntax error and we’d
get alerted. More likely, if some other service goes down (like s3
or
something) our service will restart fifty times and then comes back.
Somewhat comically, if a service is crashing over and over, you get multiple
stopping
events, but no starting
events. I think this is because a
respawn
is technically not actually one of the events. Annoying.
I am sad to say that I haven’t found a great resource for how to monitor effectively. The main thing that I can say is that if you are willing to think for an hour or so you might be able to come up with an alert that is less likely to trigger spuriously but also give you more time to react.
The alert discussed in this post could be expressed such that any time a service exits non-zero you get an alert. That would be the worst.
(The following includes affiliate links.)
The SRE Book discusses some of this, though it dedicates an absurd amount of time to an internal monitoring tool that is, as far as I understand it, going away.
I also suspect that Brendan Gregg’s Systems Performance would be worth reading. Good alerting requires good collection and, later, good analysis. This book can help with some of that.
Posted Mon, Sep 25, 2017If you're interested in being notified when new posts are published, you can subscribe here; you'll get an email once a week at the most.