fREWdiculous!
1 Jun
Since the beginning of my serious webcomic journey with xkcd, I think that was four years ago, I’ve been writing little scripts to help me get started. The first type of script is to grab integer-based, monotonically increasing files. Very easy. Done in Ruby.
1 2 3 4 5 6 7 | #!/usr/bin/ruby -w Fromat = "http://foobar.com/comics/%08d.gif" 1.upto(986) do |i| `wget #{sprintf(Fromat, i)}` sleep 1 end |
The next harder are the ones that are based on the date of publication. Usually though, they will be published Monday-Wed-Fri or something like that, so you can just increase per day and then check if it’s the correct weekday. See more Ruby.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | #!/usr/bin/ruby -w Day = 60 * 60 * 24 Fromat = "http://www.foobar.com/comics/st%Y%m%d.gif" t = Time.local(2005, 2, 5) MWF = [1,3,5] until t == Time.local(2007, 7, 9) if MWF.include? t.wday `wget #{t.strftime(Fromat)}` sleep 3 end t += Day end |
And then lastly, and hardest of all, are arbitrary files that can only be ascertained by clicking links. Perl + CPAN to the rescue!!!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | #!perl use strict; use warnings; use feature ':5.10'; use WWW::Mechanize; my $mech = WWW::Mechanize->new( autocheck => 1 ); sub process_page { my @images = $mech->find_all_images( url_abs_regex => qr{http://www\.foobar\.com/memberimages/.*\.jpg}i ); foreach (@images) { my $url = $_->url; if ($url !~ qr/banner/i) { say "downloading $url"; qx{wget $url}; } } } $mech->get( 'http://www.foobar.com/foo/bar/series.php?view=single&ID=72709' ); process_page; while ( $mech->follow_link( # third link on page matching regex n => 3, url_abs_regex => qr{http://www\.webcomicsnation\.com/dmeconis/familyman/series\.php\?view=single&ID=\d+}i ) ) { sleep 1; process_page; } |
This last one should be checked on every now and then as it is easy for it to get stuck in an infinite loop on the last couple comics.
Anyway, enjoy! This set of scripts should take care of all of your webcomic scraping needs
Note: these are not to avoid ads, but to speed up the initial reading process as speed is an issue when reading 400 or more strips.
One Response for "Web Comic Downloaders"
Nice coincidence you should post this right now, as I currently have a terminal window currently scrolling with status messages from my own downloading of a webcomic archive so I can catch up.
Fortunately the one I’m getting now is of the first variety, with a nicely sequential number system. I was able to get this down to a single line in bash:
for i in `seq 1402`; do echo “http://comic.com/images/$i.png” >> list; done && wget -P images/ -i list -o log -w 8 –random-wait -U “$UserAgent”
The first part creates a file that lists the links to all of the images (note: those are backticks, not quotation marks, around the sequence command — of course adjust the number there to whatever the most recent # comic is).
The wget parameters are:
-P a prefix to add to the files to save to, in this case an ‘images’ directory
-i imports a list of urls from my “list” file
-o directs all output to log file “log”, so I can scan through for any errors later
-w base the pause time between requests on 8 seconds
–random-wait randomize the pause to between 0.5w and 1.5w (4 and 12 seconds)
-U set the user-agent, in my case this is a variable that gets loaded with bash for just such scripts — this sets how your request looks in their logs
I’ve done scripts like this in bash-only as well as in python. I love scripting this stuff!
Leave a reply