Log Loss Detection

We spent hours debugging a logging issue Friday and Monday. If you use UUIDs in Perl, you should read this post.

I’ve been working on a logging overhaul at ZipRecruiter for more than six months. I intend to post a retrospective soon, but there are still some unknowns that mean I don’t actually know that we’re there yet.

One of the features of our logging system is built in Log Loss Detection. Logs at ZipRecruiter are a JSON doc per line. Logs that are not a JSON doc end up getting discarded, basically. If an application wants log loss detection the application needs to add two fields:

script_exec_id
script_exec_nth

script_exec_id is meant to be unique for each running application. There is nuance there that I will get into after defining script_exec_nth.

script_exec_nth is meant to be a monotonically increasing integer, starting at 1 when the application starts.

There are basically two ways to implement this. For unthreaded languages (at ZR this means Perl and Python) you generate a new script_exec_id when your pid changes due to a fork and reset script_exec_nth to 1. For threaded languages (at ZR this means Go and Java) you can have a single script_exec_id and use an atomic increment.

The result is that you have a lot of log lines like this (only a few values shown here):

{"script_exec_id":"a","script_exec_nth":1,"message":"foo"}
{"script_exec_id":"b","script_exec_nth":1,"message":"bar"}
{"script_exec_id":"a","script_exec_nth":2,"message":"xyzzy"}
{"script_exec_id":"b","script_exec_nth":2,"message":"baz"}
{"script_exec_id":"a","script_exec_nth":3,"message":"biff"}
{"script_exec_id":"b","script_exec_nth":3,"message":"bong"}

Once the logs have gotten all the way to S3 we can use Athena to find log loss:

WITH a AS (
SELECT json_extract_scalar(record, '$.script_exec_id') AS uuid,
       json_extract_scalar(record, '$.script_exec_nth') AS nth
  FROM core.prod_unified_logs
  WHERE log_date = 20180725 AND log_hour = 1
)
SELECT uuid, MAX(nth) - MIN(nth) + 1 AS expected_count, COUNT(nth) AS actual_count,
       MAX(nth) - MIN(nth) + 1 - COUNT(nth) AS loss_count
FROM a
GROUP BY 1
HAVING MAX(nth) - MIN(nth) + 1 <> COUNT(nth)

This query will return a uuid, an expected_count, an actual_count, and a loss_count. The actual query is more complicated to assist debugging when this happens, but this is the gist of it.

The biggest issue we found when we rolled this out was totally ridiculous loss_counts; specifically stratospherically many, or negative numbers. I dug into both situations by finding logs that still existed on the original hosts and examining them, just in case the loss actually occurred in the pipeline. They were both basically the same issue. Here’s how you get negative numbers (-3 in this case):

{"script_exec_id":"a","script_exec_nth":1,"message":"foo"}
{"script_exec_id":"a","script_exec_nth":1,"message":"bar"}
{"script_exec_id":"a","script_exec_nth":2,"message":"xyzzy"}
{"script_exec_id":"a","script_exec_nth":2,"message":"baz"}
{"script_exec_id":"a","script_exec_nth":3,"message":"biff"}
{"script_exec_id":"a","script_exec_nth":3,"message":"bong"}

So the max is 3, the min is 1, and the total is 6; 3 - 1 + 1 - 6 == -3.

The above happens if a process forks but for some reason the script_exec_id doesn’t change. Given the same intution, imagine a process that logs very slowly logging at the same time a process logs quickly. You might see an hour window like this:

{"script_exec_id":"a","script_exec_nth":1000,"message":"fast logger"}
{"script_exec_id":"a","script_exec_nth":1001,"message":"fast logger"}
...
{"script_exec_id":"a","script_exec_nth":5000,"message":"fast logger"}
{"script_exec_id":"a","script_exec_nth":3,"message":"slow logger"}

The max is 5000, the min is 3, the total is 4001; 5000 - 3 + 1 - 4001 == 997

How did we get the duplicate UUIDs? We read the code carefully and found no leads. I spent a lot of time tracking down this bug. The short version is that sometimes glibc caches the results getpid and if you bypass syscalls like fork the cache will not get blown. I tried to find patterns around this theory, failed, and ended up ignoring it over the weekend.

On Monday my coworker Jeff Moore and I discussed this and it became clear that it couldn’t be the getpid being cached, because the pid was changing in other parts of the log line, the script_exec_nth was getting reset, just the script_exec_id was wrong.

Early on I checked to see that the UUID library we were using Data::UUID didn’t immediately produce duplicates:

$ perl -MData::UUID -E'$u = Data::UUID->new(); fork; say $u->create_str'
3171F728-9090-11E8-AFD5-0ADC68237998
3171FA5C-9090-11E8-9481-0ADC68237998

It didn’t, but after seeing such evidence, I had to try harder:

$ perl -MData::UUID -E'$u = Data::UUID->new(); my $parent = $$; $parent == $$ && fork for 1..shift; say $u->create_str' 1000 | sort | uniq -c | grep -v '^\s*1\s'
      2 AE3B7A18-9090-11E8-95F3-3BEC68237998
      2 AE426800-9090-11E8-AAFA-3BEC68237998
      2 AE429604-9090-11E8-AFED-3BEC68237998
      2 AE44C10E-9090-11E8-BEA1-3BEC68237998
      2 AE44CF5A-9090-11E8-B866-3BEC68237998
      2 AE458990-9090-11E8-A8ED-3BEC68237998
      2 AE4C9D3E-9090-11E8-B8E3-3BEC68237998
      2 AE4CC5D4-9090-11E8-BE7D-3BEC68237998
      2 AE4CCFCA-9090-11E8-9268-3BEC68237998

Bingo. I was able to reproduce this on VMs in EC2, on my laptop, friends reproduced it on OSX, and Windows. I did a quick search and found Data::UUID::LibUUID and verified that it did not have this issue with either of the supported UUID versions. After the code was deployed we verified that the issues above were resolved, and they were.

I’ve thought about the implications of this bug quite a bit. Initially I assumed that I simply hadn’t read the docs for Data::UUID carefully enough. Here’s a line from the official docs:

The algorithm provides reasonably efficient and reliable framework for generating UUIDs and supports fairly high allocation rates – 10 million per second per machine – and therefore is suitable for identifying both extremely short-lived and very persistent objects on a given system as well as across the network.

On top of that, Data::UUID and Data::GUID (which is based on the former,) are incredibly popular and thus are de facto standards. In any case I suggest that if your UUIDs need to be unique (and they almost surely do) you use Data::UUID::LibUUID, or maybe something else.

We are still iterating on the log loss found by this system. Most loss we have found so far is at the logger level, not the pipeline level. A really nice side-effect of this system is that when you find yourself in the place to be investigating log loss you can run this simple oneliner to see the gaps:

$ diff <(grep -Fh $UUID whatever.log | jq .script_exec_nth | sort -n) <(seq  1 4578')
471a472
> 900
472a474,475
> 902

The above implies that the records for line 900 and 902 are missing. Note that the requirement of sort -n implies a bug in our logging framework which means we see some phantom loss that disappears when we increase the amount of time we look at.

This log loss metric, while requiring a lot of work at many levels, is so valuable. It lets us detect that loss has occurred before the logs have rolled off the box and we can get the logs off (if they ever existed in the first place) and figure out what’s going wrong.

(The following includes affiliate links.)

The first book I’d suggest to dig deeper on this topic would be The SRE Book. It can be a bit “Googly” in that it goes into far too much detail in some places, with far too little in other places, but it can be a source for inspiration.

I don’t know what else will help learn how to do this. I’m reading A Crack in Creation right now and it’s really interesting. This topic (gene editing) will certainly become more and more relevant.

Posted Wed, Jul 25, 2018

If you're interested in being notified when new posts are published, you can subscribe here; you'll get an email once a week at the most.