Investigation: Why is SQS so slow?
Recently I spent time figuring out why sending items to our message queue often took absurdly long. I am really pleased with both my solutions and my methodogy, maybe you will be too.
At ZipRecruiter we use AWS SQS for our message queue. As I suspect is typical, we use message queues to avoid talking to external services directly from web workers. The reason, which I have written obliquely about before, is that external services inevitably get slow, go down, or whatever, and end up causing your web workers to be completely saturated, blocking on said external service. I’ve seen it happen with SMTP, REST APIs, and even foundational backends like databases. When possible, not using the backing service from the web worker is the best option, and adding something to a message queue that will allow a batch process to talk to the service is a good way to make that happen.
So we recently did some work to migrate a large chunk of SMTP traffic to SQS. When the work was done an incredible number of requests started taking a really long time (22s.) One morning the CTO, Craig Ogg, asked me if I’d be willing to take a look.
I’ve worked on our SQS code before; specifically when I needed to add IAM EC2
role support. Our SQS module is straightforward; there is a single
_request
method and a boatload of wrappers for each API call. The request
call is simple: it builds up the HTTP::Request object, signs it, pulls it back
apart, and hands it off to the UserAgent to perform. The only reason we don’t
give the request object directly to a UserAgent is this code is written to be
as fast as possible so skips some abstraction for performance.
All that said: after a careful reading of our code I saw no obvious problems. Here is the timeline of interesting bits:
🔗 Logging
The SQS library is generic enough to go to CPAN, so stuff like logging was delegated to callers. I decided to add logging directly to the library so that even if a caller ignored a class of exceptions we would still know. After the logging was live for less than ten hours we had some (previously unknown) details. Exactly three errors:
- Broken Pipe
- Connection timed out
- Connection reset by peer
They all came from a single callsite in Furl, the UserAgent the code uses.
🔗 TCP Tuning
A couple of the errors above could be explained by NAT forgetting about long-lived sockets. The reason we have to go through NAT is boring and pointless, but we successfully reproduced the problem by sending an SQS message, sleeping six minutes, and then trying to send another message. The fix was to tell the kernel to send TCP Keepalive packets more often:
sysctl -w net.ipv4.tcp_keepalive_time=250
sysctl -w net.ipv4.tcp_keepalive_intvl=75
And to ensure that the TCP_KEEPALIVE flag was on for the sockets returned by the
connect
method in Furl::HTTP
:
use Socket qw(SOL_SOCKET SO_KEEPALIVE);
setsockopt $sock, SOL_SOCKET, SO_KEEPALIVE, 1
if $sock;
This fixed the problem we reproduced, but made no clear difference on our servers in either staging or production.
🔗 Instrumentation
The amount of exceptions logged was incredible and I was astounded that ElasticSearch didn’t crash. I wrote a little bit of code to translate known exceptions to stats, which are much lighter weight. At the same time I added some other useful stats: duration of request, retry count, etc. The added stats made it clear that we were getting network exceptions and additionally an astounding amount of library level retries; the main caller of the SQS library retries for certain exceptions.
At this point Aaron had read relevant kernel source and was of the opinion that
Furl was just doing something wrong in its select(2)
loop. We
tried one quick thing (checking errors with getsockopt
after connect) in case
Furl not checking was masking a real problem, but that made no difference. The
next easy option was to try another HTTP UserAgent.
The main reason I was willing to make such a drastic change at this point is that the HTTP Keepalive implementation Furl provides is barely sufficent to even work, let alone be called correct. Instead of maintaining a pool, checking timers, etc, it simply has a “pool” of the one last used connection and reuses it forever, assuming the other side will close it eventually.
🔗 Curl
Years ago I read an HTTP UserAgent benchmark that mje published.
The details may no longer be super accurate, but honestly I just wanted to avoid
anything notably slower. Using the benchmark I decided to go with Curl via
Net::Curl
. I was implementing this for SQS, so I only needed to
support GET
s, which meant building a client compatible with Furl::HTTP
would
be pretty simple. Here is the (slightly trimmed) code:
package ZR::Curl;
use 5.20.0;
use warnings;
use experimental 'signatures';
use Net::Curl::Easy qw(/^CURLOPT_/ /^CURLINFO_/ /^CURLPROTO_HTTP/ );
use namespace::clean;
use parent 'Net::Curl::Easy';
sub new ($class) {
my $self = $class->SUPER::new();
$self->setopt( CURLOPT_USERAGENT, "ZR::Curl/v0.1" );
$self->setopt( CURLOPT_PROTOCOLS, CURLPROTO_HTTP | CURLPROTO_HTTPS );
$self->setopt( CURLOPT_TCP_KEEPALIVE, 1 );
$self->setopt( CURLOPT_TIMEOUT, 2 );
return $self;
}
sub get ($self, $uri, $headers = [] ) {
my ($body, $head) = ( '', '' );
$self->setopt( CURLOPT_FILE, \$body );
$self->setopt( CURLOPT_HEADERDATA, \$head );
$self->setopt( CURLOPT_URL, $uri );
$self->setopt( CURLOPT_HTTPHEADER, $headers );
$self->perform;
my ($minor, $code, $msg, $ret_headers) =
($head =~ m/HTTP\/1\.(.) ([0-9]{3}) (.*?)\r\n(.*)$/s);
my @headers = map { split /:\s/, $_, 2 } split /\r\n/, $ret_headers;
return ($minor, $code, $msg, \@headers, $body);
}
1;
I have more to say about the above, but swapping in the above client completely fixed our problems. We reduced our timeout to something more reasonable (it was 22s) but also the occurance of timeouts is much less common, presumably because Curl closes expired sockets instead of trying to use them anyway.
🔗 Things I Like About Curl
Ignoring the better handling of HTTP Keepalive, there are still many things I like about Curl. The main one is that errors are clearly enumerated. In some languages this may be the norm, but in Perl it’s frustratingly rare. With Curl you can (and indeed I did) do some research on errors and plan ahead of time for different failures. Typically in Perl I end up doing this by running the code and seeing what happens.
Fundamentally Curl works by allowing you to set up your request, do it, and examine the results. Most UserAgents expose all kinds of methods which allow you to do various kinds of requests, build URIs, manipulate one or more headers, etc. Curl exposes a little over a dozen methods, a handful of which you’d never use in perl anyway. The interface for me is basically:
setopt
to prepare the request and other featuresperform
to actually do the request
Because the model is so simple, the documentation is too. The list of the many, many options is here.
On top of the excellent documentation there are lots of features that do not exist in most UserAgents:
- specify which interface to use
- set a minimum speed limit
- debug nearly all levels of the protocol: HTTP, TLS, TCP
Note that those all have examples. Despite the fact that they are in C they
still make it fairly obvious (at least to me) how you can use them. I
especially think that the debug function is useful. Nowadays because so much is
HTTPS I can’t trivially strace
a process to see what headers it is sending,
but the debug function gives me just what I need.
I doubt that Curl will become the one true UserAgent at ZipRecruiter; but I definitely see swapping it in for Furl, which we use in many places. I suspect Furl is fine for HTTP, but once you use HTTPS you will likely want persistent sockets, and Furl barely supports this. One of my coworkers pointed out the fundamental difference is this:
There are always difficult tradeoffs to make; Curl chose the more user-friendly tradeoff, and Furl chose the simpler to implement tradeoff. Unfortunately I think requiring all callers to know the various possible TCP errors that the UserAgent can encounter is just too much of a burden.
(The following includes affiliate links.)
I don’t have a lot of great recommendations for further research here. A lot of my ideas depended on having a good logging and statistics setup, which is worth entire series of posts. I do think that The SRE Book has plenty to say about this stuff and is well worth the read.
Posted Sun, Aug 20, 2017If you're interested in being notified when new posts are published, you can subscribe here; you'll get an email once a week at the most.