Screen Scrape for Love with Web::Scraper
My fiancée and I have not yet picked out a date for our wedding, but we do know that we want it outdoors. We have scoped out a number of locations that can handle indoor and outdoor weddings just in case there is bad weather, but we’d prefer to have perfect weather.
After some searching I found NOAA’s NSSL, which has ridiculous amounts of data. Instead of most websites, which give you the average high temperature and average low temperature for a given day of the year from the past three years, this gives hourly measurements for basically anything back to 1910. Of course some stations are newer and whatnot, but it’s a lot of data.
Their website only lets you get one day of data at a time, so I wrote a screen scraper using the excellent Web::Scraper. Here’s most of it:
#!/usr/bin/env perl
use Modern::Perl;
use JSON;
use URI;
use Web::Scraper;
# http://www.unidata.ucar.edu/cgi-bin/gempak/manual/apxA_index
my %data_to_grab = (
SMPH => 'wind-speed',
TMPF => 'temperature',
RELH => 'humidity',
);
my $data_str = join ';', sort keys %data_to_grab;
my $weather = scraper {
# there isn't a class, so we find the table with width 90
process "table[width=90] tr", "datas[]" => scraper {
process "td:nth-child(2)", 'when' => 'TEXT';
my $i = 2;
for (sort keys %data_to_grab) {
$i++;
process "td:nth-child($i)", $data_to_grab{$_} => 'TEXT';
}
};
};
sub moar_data {
my ($y, $m, $d) = @_;
my $res = $weather->scrape( URI->new(sprintf 'http://data.nssl.noaa.gov/dataselect/nssl_result.php?datatype=sf&sdate=%4i-%02i-%02i&hour=00&sdate2=%4i-%02i-%02i&hour2=23&outputtype=list¶m_val=%s&area=&area=@DFW', $y, $m, $d, $y, $m, $d, $data_str));
warn sprintf "%4i-%02i-%02i\n", $y, $m, $d;
sleep 3 + rand(2);
grep {
# undefined when means there wasn't actualy an observation
defined $_->{when} &&
# ignore headers
$_->{when} ne 'YYMMDD/HHMM'
} @{$res->{datas}}
}
my @end = (
map {
my $year = $_;
(map { moar_data($year, 9, $_) } (1..30)),
(map { moar_data($year, 10, $_) } (1..31)),
} ( 1990..2010 )
);
print to_json(\@end, { pretty => 1 });
The scraper object grabs a bunch of the data from TD’s in the table, skipping the first TD. I made the moar_data function which just takes year, month, day so that I could get more data. It outputs all the data as json, my prefered data format.
If you did the math at home, you realized this is a ridiculous amount of observations; something along the lines of 14 thousand observations. That means you can’t just look at it. So I also wrote a little tool to slice and dice the data. Check it out:
#!/usr/bin/env perl
use Modern::Perl;
use JSON;
use List::Util qw(min max);
use Statistics::Basic qw(mean stddev);
my $field = $ARGV[0];
die "please choose a field to research" unless $field;
die "$field is not a valid field!" unless grep { $_ eq $field }
qw(wind-speed temperature humidity);
# expected format:
# [
# {
# wind-speed => 123,
# temperature => 123,
# when => 'YYMMDD/HHMM',
# },
# ...
# ]
my $data = from_json(do {
local $/ = undef;
open my $fh, 'weather.json';
<$fh>
});
# final format:
# MMDD/HHMM => [{...}],
my %by_day;
for (@$data) {
my $when = $_->{when};
$when =~ s/^\d\d//; # remove the year part
$by_day{$when} = [] unless $by_day{$when};
push @{$by_day{$when}}, $_;
}
say 'datetime ,mean ,stddev,min,max ';
for (sort keys %by_day) {
my @list = map $_->{$field}, grep {
# this is weird, -9999.00 is apparently what they used
# before they had undef?
defined $_->{$field} &&
$_->{$field} != -9999.00
} @{$by_day{$_}};
my $avg = sprintf '%3.2f', mean \@list;
my $min = sprintf '%3.2f', min @list;
my $max = sprintf '%3.2f', max @list;
my $stddev = sprintf '%3.2f', stddev \@list;
say "$_,$avg,$stddev,$min,$max";
}
Anyway, this was a fun project and a nice little valentines day surprise. Hope someone finds it useful :-)
Posted Fri, Feb 18, 2011If you're interested in being notified when new posts are published, you can subscribe here; you'll get an email once a week at the most.