A Foolish Manifesto

fREWdiculous!

Archive for the ‘Ruby’ Category

Web Comic Downloaders

Since the beginning of my serious webcomic journey with xkcd, I think that was four years ago, I’ve been writing little scripts to help me get started. The first type of script is to grab integer-based, monotonically increasing files. Very easy. Done in Ruby.

1
2
3
4
5
6
7
#!/usr/bin/ruby -w

Fromat = "http://foobar.com/comics/%08d.gif"
1.upto(986) do |i|
  `wget #{sprintf(Fromat, i)}`
  sleep 1
end

The next harder are the ones that are based on the date of publication. Usually though, they will be published Monday-Wed-Fri or something like that, so you can just increase per day and then check if it’s the correct weekday. See more Ruby.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/usr/bin/ruby -w

Day = 60 * 60 * 24

Fromat = "http://www.foobar.com/comics/st%Y%m%d.gif"

t = Time.local(2005, 2, 5)

MWF = [1,3,5]

until t == Time.local(2007, 7, 9)
  if MWF.include? t.wday
    `wget #{t.strftime(Fromat)}`
    sleep 3
  end

  t += Day
end

And then lastly, and hardest of all, are arbitrary files that can only be ascertained by clicking links. Perl + CPAN to the rescue!!!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!perl
use strict;
use warnings;
use feature ':5.10';

use WWW::Mechanize;
my $mech = WWW::Mechanize->new( autocheck => 1 );

sub process_page {
   my @images = $mech->find_all_images(
      url_abs_regex => qr{http://www\.foobar\.com/memberimages/.*\.jpg}i
   );
   foreach (@images) {
      my $url = $_->url;

      if ($url !~ qr/banner/i) {
         say "downloading $url";
         qx{wget $url};
      }
   }
}

$mech->get( 'http://www.foobar.com/foo/bar/series.php?view=single&ID=72709' );
process_page;
while (
   $mech->follow_link(
      # third link on page matching regex
      n             => 3,
      url_abs_regex =>
         qr{http://www\.webcomicsnation\.com/dmeconis/familyman/series\.php\?view=single&ID=\d+}i
   )
) {
   sleep 1;
   process_page;
}

This last one should be checked on every now and then as it is easy for it to get stuck in an infinite loop on the last couple comics.

Anyway, enjoy! This set of scripts should take care of all of your webcomic scraping needs :-)

Note: these are not to avoid ads, but to speed up the initial reading process as speed is an issue when reading 400 or more strips.

  • 1 Comment
  • Filed under: Ruby, perl
  • Migrating from IIS to Apache

    At my job we use a combination of IIS, SQL Server, and Perl. In general it works pretty well. But there is one major problem: if we ever do a warn in perl, instead of printing the message to the log, it crashes the server. That’s a big deal since multiple people are using the server and fixing the issue means VNCing in and recycling the app pool. This doesn’t always happen, but it happens a lot; enough to make me consider setting up Apache on my personal computer so that I can get some serious logging. Anyway, I don’t know if we have a typical setup or not, but this is what I had to do to get it all going.

    1. Install ActiveState perl into C:/usr (not C:/Perl.) You can get the latest version here. That’s more or less it for installing perl. Note: Latest at the time of writing is Perl 5.10.0.1004
    2. Install Apache. You can get the latest version here. I suggest getting the binary msi. Get the one with OpenSSL if you want to set up https (not covered here.) Note: Latest at the time of writing is Apache 2.2.10 (OpenSSL 0.9.8i).
    3. Install Perl Modules. I am not sure of what all modules our software requires that doesn’t come with perl out of the box, but I know for sure that we need DateTime. So to install that open a console and type ppm install DateTime. You can use the gui instead if you’d like, but it tends to just get in my way because it’s so slow. The way you will know that you are missing a module is if you get an error in the log like this:
      1
      2
      3
      [Tue Nov 18 17:33:05 2008] [ERROR] [client 127.0.0.1] Premature END of script headers: foo.plx, referer: http://127.0.0.1/
      [Tue Nov 18 17:33:05 2008] [ERROR] [client 127.0.0.1] Can't locate DateTime.pm in @INC (@INC contains: C:/usr/site/lib C:/usr/lib .) at foo.plx line 39., referer: http://127.0.0.1/
      [Tue Nov 18 17:33:05 2008] [ERROR] [client 127.0.0.1] BEGIN failed--compilation aborted at foo.plx LINE 39., referer: http://127.0.0.1/
    4. Migrate source code. For me this just meant checking out one folder from subversion into C:/Inetpub (recommended so that things will continue to work with IIS and apache,) and copying a directory of static html into the same directory. I ended up with two main directories like this: C:/Inetpub/main and C:/Inetput/static. (Names changed to protect the innocent :-) )
    5. Configure Apache. This, besides the last step, is probably the hardest step. First open httpd.conf, probably C:/Program Files/Apache Software Foundation/Apache2.2/conf/httpd.conf… but you can find it in the start menu). On IIS our static directory is the root and then the main directory is a subdirectory of the static directory. To set this up, first find the
      1
      DocumentRoot "..."

      directive in the httpd.conf and change it to

      1
      DocumentRoot "C:/Inetpub/static"

      Next you’ll want to make sure that your Directories are configured. Find the existing Directory section and just change the directory to whatever you just did, and then add another one for each other directory in Inetpub. This is what I ended up with:

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      <Directory "C:/Inetpub/static">
          Options Indexes FollowSymLinks ExecCGI
          AllowOverride None
          Order allow,deny
          Allow from all
      </Directory>
      <Directory "C:/Inetpub/main">
          Options Indexes FollowSymLinks ExecCGI
          AllowOverride None
          Order allow,deny
          Allow from all
      </Directory>

      And then because our static directory was root and the main directory was a subdirectory of the static directory, add a line like the following to the alias_module section:

      1
      Alias /user /Inetpub/main

      Also, since we are a perl shop, we have to allow execution of various types of perl programs, so find the mime_module section of the code and make the AddHandler part look like this:

      1
      AddHandler cgi-script .cgi .plx .plex

      And then last of all, the main page of our root directory on IIS is Default.html, so instead of renaming it to index.html, find the secion of the code for the Directory and add a DirectoryIndex part so it is like this:

      1
      DirectoryIndex Default.html

      I ended up setting it for both main and static. Here’s my entire httpd.conf if you just wanna see the final product:

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      74
      75
      76
      77
      78
      79
      80
      81
      82
      83
      84
      85
      ServerRoot "C:/Program Files/Apache Software Foundation/Apache2.2"
      Listen 80
      LoadModule actions_module modules/mod_actions.so
      LoadModule alias_module modules/mod_alias.so
      LoadModule asis_module modules/mod_asis.so
      LoadModule auth_basic_module modules/mod_auth_basic.so
      LoadModule authn_default_module modules/mod_authn_default.so
      LoadModule authn_file_module modules/mod_authn_file.so
      LoadModule authz_default_module modules/mod_authz_default.so
      LoadModule authz_groupfile_module modules/mod_authz_groupfile.so
      LoadModule authz_host_module modules/mod_authz_host.so
      LoadModule authz_user_module modules/mod_authz_user.so
      LoadModule autoindex_module modules/mod_autoindex.so
      LoadModule cgi_module modules/mod_cgi.so
      LoadModule dir_module modules/mod_dir.so
      LoadModule env_module modules/mod_env.so
      LoadModule include_module modules/mod_include.so
      LoadModule isapi_module modules/mod_isapi.so
      LoadModule log_config_module modules/mod_log_config.so
      LoadModule mime_module modules/mod_mime.so
      LoadModule negotiation_module modules/mod_negotiation.so
      LoadModule setenvif_module modules/mod_setenvif.so
      <IfModule !mpm_netware_module>
      <IfModule !mpm_winnt_module>
      User daemon
      Group daemon
      </IfModule>
      </IfModule>
      ServerAdmin frewmbot@gmail.com
      DocumentRoot "C:/Inetpub/static"
      <Directory />
          Options FollowSymLinks
          AllowOverride None
          Order deny,allow
          Deny from all
      </Directory>
      <Directory "C:/Inetpub/static">
          Options Indexes FollowSymLinks ExecCGI
          AllowOverride None
          Order allow,deny
          Allow from all
          DirectoryIndex Default.html
      </Directory>
      <Directory "C:/Inetpub/main">
          Options Indexes FollowSymLinks ExecCGI
          AllowOverride None
          Order allow,deny
          Allow from all
          DirectoryIndex main.plx
      </Directory>
      <IfModule dir_module>
          DirectoryIndex index.html
      </IfModule>
      <FilesMatch "^.ht">
          Order allow,deny
          Deny from all
          Satisfy All
      </FilesMatch>
      ErrorLog "logs/error.log"
      LogLevel warn
      <IfModule log_config_module>
          LogFormat "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"" combined
          LogFormat "%h %l %u %t "%r" %>s %b" common
          <IfModule logio_module>
            LogFormat "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" %I %O" combinedio
          </IfModule>
          CustomLog "logs/access.log" common
      </IfModule>
      <IfModule alias_module>
          Alias /user /Inetpub/main
          ScriptAlias /cgi-bin/ "C:/Program Files/Apache Software Foundation/Apache2.2/cgi-bin/"
      </IfModule>
      <Directory "C:/Program Files/Apache Software Foundation/Apache2.2/cgi-bin">
          AllowOverride None
          Options None
          Order allow,deny
          Allow from all
      </Directory>
      DefaultType text/plain
      <IfModule mime_module>
          TypesConfig conf/mime.types
          AddType application/x-compress .Z
          AddType application/x-gzip .gz .tgz
          AddHandler cgi-script .cgi .plx .plex
      </IfModule>
    6. Ensure all of your perl files start with #!/usr/bin/perl. If you don’t do this Apache will give you an error, 500 Internal Server on the output, and then something like this in the log:
      1
      2
      [Tue Dec 09 20:59:04 2008] [error] [client 127.0.0.1] (OS 3)The system cannot find the path specified.  : couldn't create child process: 720003: employee_training_report.plx
      [Tue Dec 09 20:59:04 2008] [error] [client 127.0.0.1] (OS 3)The system cannot find the path specified.  : couldn'
      t spawn child process: C:/Inetpub/epms/customer/employee_training_report.plx

      As stated, this just means that the bangline is wrong and needs to be set to #!/usr/bin/perl. Note: the log is probably in C:/Program Files/Apache Software Foundation/Apache2.2/logs/error.log, but again, you can find that in the start menu.

    7. Fix all headers. Usually with IIS we output headers like this:
      1
      2
      print "HTTP/1.0 200 OK\n";
      print header;

      header is a function of the CGI module. For IIS you print out the first part to force the server into NPH mode. I recommend, for ease of migration, making your own module that your scripts can use that will print the header correctly whether it’s Apache or IIS. Here’s ours:

      1
      2
      3
      sub header {
          return (($ENV{PERLXS})?"HTTP/1.0 200 OK\r\n":"").CGI->header(@_);<br />
      }

      And note that anything that gets passed to the header method automatically takes any arguments that you passed and gives them to CGI. This allows for a simple method of regular expression based search and replace to fix things to use your new method. (for vim something like this will work: :%s/v^(s*prints+).*header((.*));/1Module::header2;/g )

      I do this part as I see it as a problem, as my boss didn’t want me to search and replace the whole codebase, so the error you are going to look for is a 500 from the browser and then something like this in the log:

      1
      [Tue Nov 18 17:38:38 2008] [error] [client 127.0.0.1] malformed header from script. Bad header=HTTP/1.0 200 OK: foo.plx

    And that’s basically it! Any tips you might have to add are welcome!

  • 0 Comments
  • Filed under: Ruby, Uncategorized
  • Ruby 1.9 is out!

    Exciting! It was apparently put up yesterday, on Christmas. What a cool gift right? I looked through the changed maintained my Mauricio and here are /my/ favorites.

    *New literal hash syntax [Ruby2]*

    1
    {a: "foo"}      # => {:a=>"foo"}

    *.() and calling Procs without #call/#[] [EXPERIMENTAL]*

    You can now do:

    1
    a = lambda{|*b| b} a.(1,2) # => [1, 2]

    *Multiple splats allowed*

    1.9 allows multiple splat operators when calling a method:

    1
    2
    3
    4
    5
       def foo(*a)
         a
       end

       foo(1, *[2,3], 4, *[5,6])                        # => [1, 2, 3, 4, 5, 6]

    *Mandatory arguments after optional arguments allowed*

    1
    2
    3
    4
       def m(a, b=nil, *c, d)
         [a,b,c,d]
       end
       m(1,2)                                         # => [1, nil, [], 2]

    *Object#tap*

    Passes the object to the block and returns it (meant to be used for call chaining).

    1
    "F".tap{|x| x.upcase!}[0] # => "F" # Note that "F".upcase![0] would fail since upcase! would return nil in this # case.

    *Module#attr is an alias of attr_reader*

    Use

    1
    attr :foo=

    to create a read/write accessor. (RCR#331)

    *Enumerable#cycle*

    Calls the given block for each element of the enumerable in a never-ending cycle:

    1
    2
    a = ["a", "b", "c"]
    a.cycle {|x| puts x }  # print, a, b, c, a, b, c,.. forever.

    *Enumerable#group_by*

    Groups the values in the enumerable according to the value returned by the block:

    1
    (1..10).group_by{|x| x % 3} # => {0=>[3, 6, 9], 1=>[1, 4, 7, 10], 2=>[2, 5, 8]}

    *Enumerable#drop*

    Without a block, returns an array with all but the first n elements from the enumeration. Otherwise drops elements while the block returns true (and returns all the elements after it returns a false value):

    1
    2
    a = [1, 2, 3, 4, 5] a.drop(3) # => [4, 5]
    a.drop {|i| i < 3 } # => [3, 4, 5]

    *Enumerable#inject (#reduce) without a block*

    If no block is given, the first argument to #inject is the name of a two-argument method that will be called; the optional second argument is the initial value:

    1
    [RUBY_VERSION, RUBY_RELEASE_DATE] # => ["1.9.0", "2007-08-03"] (1..10).reduce(:+) # => 55

    *Enumerable#count*

    It could be defined in Ruby as

    1
    def count(*a) inject(0) do |c, e| if a.size == 1 # suspect, but this is how it works (a[0] == e) ? c + 1 : c else yield(e) ? c + 1 : c end end end

    Therefore

    1
    ["bar", 1, "foo", 2].count(1) # => 1 ["bar", 1, "foo", 2].count{|x| x.to_i != 0} # => 2

    *Array#nitems*

    It is equivalent to selecting the elements that satisfy a condition and obtaining the size of the resulting array:

    1
    %w[1 2 3 4 5 6].nitems{|x| x.to_i > 3}      # => 3

    *Block argument to Array#index, Array#rindex [Ruby2]*

    They can now take a block to make them work like #select.

    1
    ['a','b','c'].index{|e| e == 'b'} # => 1 ['a','b','c'].index{|e| e == 'c'} # => 2 ['a','a','a'].rindex{|e| e == 'a'} # => 2 ['a','a','a'].index{|e| e == 'b'} # => nil

    *Array#combination*

    1
    ary.combination(n){|c| ...}

    yields all the combinations of length n of the elements in the array to the given block. If no block is passed, it returns an enumerator instead. The order of the combinations is unspecified.

    1
    a = [1, 2, 3, 4] a.combination(1).to_a #=> [[1],[2],[3],[4]] a.combination(2).to_a #=> [[1,2],[1,3],[1,4],[2,3],[2,4],[3,4]] a.combination(3).to_a #=> [[1,2,3],[1,2,4],[1,3,4],[2,3,4]] a.combination(4).to_a #=> [[1,2,3,4]] a.combination(0).to_a #=> [[]]: one combination of length 0 a.combination(5).to_a #=> [] : no combinations of length 5

    *Array#permutation*

    1
    2
    Operates like #combination, but with permutations of length n.
    <code lang="ruby">a = [1, 2, 3] a.permutation(1).to_a #=> [[1],[2],[3]] a.permutation(2).to_a #=> [[1,2],[1,3],[2,1],[2,3],[3,1],[3,2]] a.permutation(3).to_a #=> [[1,2,3],[1,3,2],[2,1,3],[2,3,1],[3,1,2],[3,2,1]] a.permutation(0).to_a #=> [[]]: one permutation of length 0 a.permutation(4).to_a #=> [] : no permutations of length 4

    *Array#pop, Array#shift*

    They can take an argument to specify how many objects to return:

    1
    %w[a b c d].pop(2) # => ["c", "d"]

    *Hash preserves order!*

    1
    2
    3
    4
    5
    6
    7
    8
    RUBY_VERSION                    # => "1.9.0"
    h={:a=>1, :b=>2, :c=>3, :d=>4}  # => {:a=>1, :b=>2, :c=>3, :d=>4}
    h[:e]=5
    h                                           # => {:a=>1, :b=>2, :c=>3, :d=>4, :e=>5}

    h.keys                                      # => [:a, :b, :c, :d, :e]
    h.values                                    # => [1, 2, 3, 4, 5]
    h.to_a                                      # => [[:a, 1], [:b, 2], [:c, 3], [:d, 4], [:e, 5]]

    vs.

    1
    2
    3
    4
    5
    6
    7
    RUBY_VERSION                    # => "1.8.6"
    h={:a=>1, :b=>2, :c=>3, :d=>4}  # => {:a=>1, :b=>2, :c=>3, :d=>4}
    h[:e]=5
    h                                           # => {:e=>5, :a=>1, :b=>2, :c=>3, :d=>4}
    h.keys                                      # => [:e, :a, :b, :c, :d]
    h.values                                    # => [5, 1, 2, 3, 4]
    h.to_a                                      # => [[:e, 5], [:a, 1], [:b, 2], [:c, 3], [:d, 4]]

    *Numeric#upto, #downto, #times, #step*

    These methods return an enumerator if no block is given:

    1
    a = 10.times a.inject{|s,x| s+x } # => 45 a = [] b = 10.downto(5) b.each{|x| a << x} a # => [10, 9, 8, 7, 6, 5]

    *Range#cover?*

    1
    range.cover?(value)

    compares value to the begin and end values of the range, returning true if it is comprised between them, honoring #exclude_end?.

    1
    2
    ("a".."z").cover?("c")                            # => true
    ("a".."z").cover?("5")                            # => false

    *Limit input in IO#gets, IO#readline, IO#readlines, IO#each_line, IO#lines, IO.foreach, IO.readlines, StringIO#gets, StringIO#readline, StringIO#each, StringIO#readlines*

    These methods accept an optional integer argument to specify the maximum amount of data to be read. The limit is specified either as the (optional) second argument, or by passing a single integer argument (i.e. the first argument is interpreted as the limit if it’s an integer, as a line separator otherwise).

    *IO#ungetc, StringIO#ungetc*

    Allows to push back an arbitrarily large character.

    *Seven predicate methods where added for the weekdays:*

    1
    2
    Time.now        # => Thu Nov 03 18:58:25 CET 2005
    Time.now.sunday?        # => false
  • 2 Comments
  • Filed under: Ruby
  • Friday Tips and Tricks

    Time saving tips and tricks!

    This first tip is something that I use almost daily. Do you ever want to change a filename to something that is similar to the original name? For instance, maybe you just want to change/add/remove the extension? Well, if you are using a reasonable shell you can do the following:

    1
    2
    3
    4
    5
    6
    # Add .txt to the filename
    cp textfiel{,.txt}
    # change el to le
    cp textfi{el,le}.txt
    # remove extension
    cp textfile{.txt,}

    Or how about this; fairly often I will be programming and I will be adding a predefined string to the end of another string a bunch of times, except for the last time. The idea is to put the predefined string between some other things. This is pretty regular if you are generating HTML or SQL. Well, instead of doing the following:

    1
    2
    3
    4
    5
    output = ""
    some_array.each_with_index do |item,index|
    output += item
    output += " AND " unless index = some_array.length - 1
    end

    you can do:

    1
    output = some_array.join(" AND ")

    Another thing that I find myself doing often is the following:

    1
    2
    3
    4
    5
    output=""
    some_array.each_with_index do |item,index|
    output += "?"
    output += "," unless index=some_array.length-1
    end

    That will generate the question marks for an SQL statement. Again, that’s a little messy and there is a cleaner way to do it.

    1
    output = some_array.map{"?"}.join(",")

    Much better! It’s much shorter and should be easier to understand for other Ruby programmers.

    It’s good to put things like this into practice, because it will make your code more readable and easier to maintain. Generally, in my manifesto, fewer lines of code (comments and whitespace don’t count) are better. Of course, in a language like Ruby this can create performance problems; it’s a balance between what works for you as the programmer and what works for the user. If the speed is really an issue, change the code. Otherwise, save your skull!

    If you have any tips for regular things like this, let me know. I need to know stuff like this just as much as anyone else.

  • 0 Comments
  • Filed under: Ruby, Shell, Unix