Perl Advent Calendar 2009-12-06

Santy Claus, why? Why are you taking our Perl module? WHY?

by Bill 'N1VUX' Ricker

The xml_grep utility and XML-Twig 101 are good intros to the XML::Twig suite, but that does not make for a very interesting Calendar entry.

As you may have noticed, this year's Advent Calendar layout does not obscure the picture with the calendar squares (window shutters), but also does not have a full 25 windows. Since we have missed some days the last few years, this should not be a problem, right? Well, the X-Y coords and dates are hard coded, so skipping a day requires some fussy work. This clearly calls for a Perl script to parse the HTML file and increment the Day of all unused entries (only), to be run when we take a skip day. In case the authors and editors are very naughty or lazy and skip more than 3 days -- and thus will be getting coal in their stocking -- the script can even remove the last box(es) to delete a day incrementing beyond 25th.

perl -i.bak mod6.pl index.html
mv index.html.bak index.before.html
diff -u index.before.html index.html | tee result-diff.txt
...
-<br><div class="q"><a href="4/" style="left:425px; top:  5px">4</a></div>
-<br><div class="q"><a href="" style="left:545px; top:  5px">5</a></div>
-<br><div class="q"><a href="" style="left:665px; top:  5px">6</a></div>
...
-<br><div class="C"><a href="" style="left:  5px; top: 52px">22</a></div>
...
+<br /><div class="q"><a href="4/" style="left:425px; top:  5px">4</a></div>
+<br /><div class="q"><a href="" style="left:545px; top:  5px">6</a></div>
+<br /><div class="q"><a href="" style="left:665px; top:  5px">7</a></div>
...
+<br /><div class="C"><a href="" style="left:  5px; top: 52px">23</a></div>

The result is fairly subtle: The Diff as above is somewhat opaque. if you viewed this by opening the calendar page door -- if not, you miss half the traditional seasonal fun -- you may have noticed 5 missing; if not, you may need to compare closely to the previous page state, to see that the boxes from 5..22 were incremented.

(Yes Virginia, there will be a Christmas door for the 25th.)

Since it's acting as XML, Twig does not normally respect original whitespace but the keep_spaces = 1> option assists us here.

XML-Twig is allergic to some HTML that is NOT well-formed XML, so until we upgrade to XHTML, the script needs to insert end-slash in to empty tags as needed and the HTML-only Entities.1

mod6.pl

   1 use XML::Twig;
   2 use 5.010;
   3 
   4 my $t = XML::Twig->new(
   5     pretty_print => 'indented',  # output  nicely formatted
   6     keep_spaces => 1,             # wrap as original layout
   7     empty_tags   => 'html',        # outputs <empty_tag />
   8 );
   9 
  10 $contents = do { local $/; <> };    # slurp scarf
  11 
  12 # repair html to wellformed xml
  13 $contents =~ s[< $_ (?: \b [^>]*? [^/])? \K >][ />]gxism for qw[ link br img];
  14 $contents =~ s[&middot;][&#183;]gxism;
  15 
  16 eval { $t->parse($contents) }
  17    or die "$@ \n contents = $contents\n";
  18    ;
  19 my $root = $t->root;
  20 my @para =
  21   $root->get_xpath('.//a[@href=""]');    # get the  children [@class="q"]/a
  22 foreach my $para (@para) {
  23     $para->set_text( $para->text() + 1 );
  24     $para->delete() if $para->text() > 25;
  25 }
  26 
  27 # output the document
  28 $contents = $t->sprint($root);           
  29 $contents =~ s[\xb7][&middot;]gxism;     # restore html entities
  30 print $contents;
  31 

1. This does mean that <p> tags need close tags </p>, and tags must be properly nested, not straddled like <b><i> blah </b></i>.

View Source (POD)