Monitoring Service start/stop in Upstart
Recently at ZipRecruiter I implemented a tool to ensure that we know if some service is crashlooping. It was really easy thanks to Upstart but it took almost a whole day to get just right.
I called the tool cerberus; you could call it anything though. Fundamentally it
stopping events that go through Upstart, excluding
itself. Here’s the guts of the implementation:
description "Monitor service events" start on (starting JOB!="cerberus" JOB!="startpar-bridge" INSTANCE!="cerberus*") \ or (stopping JOB!="cerberus" JOB!="startpar-bridge" INSTANCE!="cerberus*") exec perl -E 'say localtime . qq( $ARGV $_) for split /\s+/, $ARGV' "$JOB" "$UPSTART_EVENTS"
This would log something like
Sun Sep 24 19:45:00 2017 www stopping.
We are actually sending this data to our stats server so we can build monitors
based on it, but that’s the basic idea. A frustrating side note is that, as a
feature, Upstart only lets a single process run for a given job. When we
initially did this I used the event
stopping instead of
stopped. The upshot was that if you ran
restart www it would
only record the first event, because the second one couldn’t start as it
happened at the same time. For some reason (I really am not sure why) the
stopping version doesn’t have that problem.
Additionally there is still a race condition. If two services advertise that
they are either
stopping at the same time, we will only hear
about one of them. Because I made this to monitor crashlooping I am not too
worried about that. If you wanted to be 100% confident you got all the events
you could build a watchdog per service, but that sounds easy to get wrong to me.
This change allowed us to remove the limit on respawning and configure all
services to respawn forever. Worst case there would be a syntax error and we’d
get alerted. More likely, if some other service goes down (like
something) our service will restart fifty times and then comes back.
Somewhat comically, if a service is crashing over and over, you get multiple
stopping events, but no
starting events. I think this is because a
respawn is technically not actually one of the events. Annoying.
I am sad to say that I haven’t found a great resource for how to monitor effectively. The main thing that I can say is that if you are willing to think for an hour or so you might be able to come up with an alert that is less likely to trigger spuriously but also give you more time to react.
The alert discussed in this post could be expressed such that any time a service exits non-zero you get an alert. That would be the worst.
The SRE Book discusses some of this, though it dedicates an absurd amount of time to an internal monitoring tool that is, as far as I understand it, going away.
I also suspect that Brendan Gregg’s Systems Performance would be worth reading. Good alerting requires good collection and, later, good analysis. This book can help with some of that.Posted Mon, Sep 25, 2017