CloudFront Migration Update

When I migrated my blog to CloudFront I mentioned that I’d post about how it is going in late March. Well it’s late March now so here goes!

First off, I switched from using the awscli tools and am using s3cmd because it does the smart thing and only syncs if the md5 checksum is different. Not only does this make a sync significantly faster, it also reduces PUTs which are a major part of the cost of this endeavour.

Speaking of costs, how much is this costing me? February, which was a partial month, cost a total of $0.03. One might expect March to cost more than four times that amount (still couch change) but because of the s3cmd change I made, the total cost in March so far is $0.04, with a forecast of $0.05. There is one cost that I failed to factor in: logging.

While my full blog is a svelte 36M, just the logs for CloudFront over the past 36 days has been almost double that; and they are compressed with gzip! The logging incurs additional PUTs to S3 as well as an additional storage burden. The free tier includes 5G of free storage, but pulling down the log files as structured (a file per region per hour gzipped) is a big hassle. I had over five thousand log files to download, and it took about an hour. I’m not sure how I’ll deal with it in the future but I may periodically pull down those logs, consolidate them, and replace them with a rolled up month at a time file.

Because the logs were slightly easier to interact with than before I figured I’d pull them down and take a look. I had to write a little Perl script to parse and merge the logs. Here’s that, for the interested:

#!/usr/bin/env perl

use 5.20.0;
use warnings;

use autodie;

use Text::CSV;

my $glob = shift;
my @values = @ARGV;
my @filelisting = glob($glob);

for my $filename (@filelisting) {
  open my $fh, '<:gzip', $filename;
  my $csv = Text::CSV->new({ sep_char => "\t" });
      date time x_edge_location sc_bytes c_ip method host cs_uri_stem sc_status
      referer user_agent uri_query cookie x_edge_result_type x_edge_request_id
      x_host_header cs_protocol cs_bytes time_taken x_forwarded_for ssl_protocol
      ssl_cipher x_edge_response_result_type
  # skip headers
  $csv->getline($fh) for 1..2;
  while (my $row = $csv->getline_hr($fh)) {
    say join "\t", map $row->{$_}, @values

To get all of the accessed URLs, with counts, I ran the following oneliner:

perl '*.2016-03-*.gz' cs_uri_stem | sort | uniq -c | sort -n

There are some really odd requests here, along with some sorta frustrating issues. Here are the top thirty, with counts:

  27050 /feed
  24353 /wp-content/uploads/2007/08/transform.png
  13723 /feed/
   8044 /static/img/me200.gif
   5011 /index.xml
   4607 /favicon.ico
   3866 /
   2491 /static/css/styles.css
   2476 /static/css/bootstrap.min.css
   2473 /static/css/fonts.css
   2389 /static/js/bootstrap.min.js
   2384 /static/js/jquery.js
   2373 /robots.txt
    966 /posts/install-and-configure-the-ms-odbc-driver-on-debian/
    637 /wp-content//uploads//2007//08//transform.png
    476 /archives/1352
    311 /wp-content/uploads/2007/08/readingminds2.png
    278 /keybase.txt
    266 /posts/replacing-your-cyanogenmod-kernel-for-fun-and-profit/
    225 /archives/1352/
    197 /feed/atom/
    191 /static/img/pong.p8.png
    166 /posts/concurrency-and-async-in-perl/
    155 /n/a
    149 /posts/weirdest-interview-so-far/
    144 /apple-touch-icon.png
    140 /apple-touch-icon-precomposed.png
    133 /posts/dbi-logging-and-profiling/
    126 /posts/a-gentle-tls-intro-for-perlers/
    120 /feed/atom

What follows is pretty intense navel gazing that I suspect very few people care about. I think it’s interesting but that’s because like most people I am somewhat of a narcissist. Feel free to skip it.

So /feed, /feed/, /feed/atom, and /feed/atom/ are in this list a lot, and sadly when I migrated to CloudFront I failed to set up the redirect header. I’ll be figuring that out soon if possible.

/, /favicon.ico, and /index.xml are all normal and expected. It really surprises me how many things are accessing / directly. A bunch of it is people, but a lot is feed readers. Why they would hit / is beyond me.

/wp-content/uploads/2007/08/transform.png and /wp-content//uploads//2007//08//transform.png (from this page) seems to be legitimately popular. It is bizarrely being accessed from a huge variety of User Agents. At the advice of a friend I looked more closely and it turns out it’s being hotlinked by a Vietnamese social media site or something. This is cheap enough that I don’t care enough to do anything about it.

/wp-content/uploads/2007/08/readingminds2.png is similar to the above.

/static/img/me200.gif is an avatar that I use on a few sites. Not super surprising, but as always: astounded at the number.

/robots.txt Is being accessed a lot, presumably by all the various feed readers. It might be worthwhile to actually create that file. No clue.

/static/css/* and /static/js/* should be pretty obvious. I would consider using those from a CDN but my blog is already on a CDN so what’s the point! But it might be worth at least adding some headers so those are cached by browsers more aggressively.

/posts/install-and-configure-the-ms-odbc-driver-on-debian/ (link) is apparently my most popular post, and I would argue that that is legitimate. I should automate some kind of verification that it continues to work. I try to keep it updated but it’s hard now that I’ve stopped using SQL Server myself.

/archives/1352 and /archives/1352/ is pre-hugo URL URL for the announcement of DBIx::Class::DeploymentHandler. I’m not sure why the old URL is being linked to, but I am glad I put all that effort into ensuring that old links keep working.

/keybase.txt is the identity proof for Keybase (which I have never used by the way.) It must check every four hours or something.

/posts/replacing-your-cyanogenmod-kernel-for-fun-and-profit/ (link) is a weird post of mine, but I’m glad that a lot of people are interested, because it was a lot of work to do.

/static/img/pong.p8.png, /posts/weirdest-interview-so-far/ (link), and /posts/dbi-logging-and-profiling/ (link) were all on / at some point in the month so surely people just clicked those from there.

/posts/concurrency-and-async-in-perl/ (link) and /posts/a-gentle-tls-intro-for-perlers/ (link) are more typical posts of mine, but are apparently pretty popular and I would say for good reason.

/n/a, /apple-touch-icon.png, /apple-touch-icon-precomposed.png all seem like some weird user agent thing, like maybe iOS checks for that if someone makes a bookmark?

🔗 World Wide Readership

Ignoring the seriously hotlinked image above, I can easily see where most of my blog is accessed:

perl '*.2016-03-*.gz' cs_uri_stem x_edge_location  | \
  grep -v 'transform' | cut -f 2 | perl -p -e 's/[0-9]+//' | \
  sort | uniq -c | sort -n

Here’s the top 15 locations which serve my blog:

  21330 JFK # New York
   9668 IAD # Washington D.C.
   8845 ORD # Chicago
   7098 LHR # London
   6536 FRA # Frankfurt
   5319 DFW # Dallas
   4568 ATL # Atlanta
   4328 SEA # Seattle
   3345 SFO # San Fransisco
   3137 CDG # Paris
   2991 AMS # Amsterdam
   2966 EWR # Newark
   2339 LAX # Los Angeles
   1993 ARN # Stockholm
   1789 WAW # Warsaw

I’m super pleased at this, because before the migration to CloudFront all of this would be served from a single server in DFW. It was almost surely enough but it’d be slower, especially for the stuff outside of the states.

Aside from the fact that I have not yet set up the redirect for the old feed URLs, I think the migration to CloudFront has gone very well. I’m pleased that I’m less worried about rebooting my Linode and that my blog is served quickly, cheaply, and efficiently to readers worldwide.

Posted Sat, Mar 26, 2016

If you're interested in being notified when new posts are published, you can subscribe here; you'll get an email once a week at the most.