YA Perl Advent Calendar 2005-12-23

Today's column is by William 'N1VUX' Ricker, of Boston.PM, who has requested today's door and prepared this review as a holiday gift for Uri.

So you've got this utility that counts connections by source, maybe it's scanning netstat -a in realtime to colorize a network diagram, or the Apache logs weekly (or hourly). Usually you'd sort by Domain Name or TLD, or by count or by port, but this time the boss wants it sorted by source IP subnet, since that's how the datacenter network is organized. So the desired report will look something like

0.0.0.0     	 3  reallylong.sillyname.foobar.do 
10.32.125.101	62 	laxweb1.firewall.foobar.dom 
10.32.125.119	98	laxweb2.firewall.foobar.dom 
10.37.136.119	45	bosweb1.firewall.foobar.dom
10.37.136.120	73	bosweb2.firewall.foobar.dom
10.21.53.103	 1	monitor.foobar.dom 
10.21.192.232	 2	ops.foobar.dom 
10.45.16.89 	 1	lanserver02.bos.foobar.dom 
10.45.183.222	 2	desktop9483.bosfoobar.dom 
10.46.120.197	 3	10.46.120.197 
10.47.156.24	 1	lanserver031.lax.foobar.dom 
10.47.156.26	 1	desktop051.lax.foobar.dom 
10.47.156.34	 2	desktop003.lax.foobar.dom 
10.47.156.40	 2	desktop031.lax.foobar.dom 
127.0.0.1   	48	localhost.localdomain 
172.24.26.77	 1	qa07.foobar.dom 
172.24.101.207	 1	qa92.foobar.dom 
172.24.101.208	 6	qa93.foobar.dom 
172.24.101.251	 1	qa05.foobar.dom 

but lots longer, so you need this to run fast.

Sorting with a simple comparison block would require way too much slicing and dicing on each of the O(N*logN) comparisons, so you have to do something efficient -- keep the O(N*logN) steps fast and the slow steps only in O(N). The classical Schwartzian Transform shows the structure of a compound key nicely,

   1 my @sorted =
   2         map { $_->[0] }
   3                 sort {
   4                         $a->[1] <=> $b->[1]
   5                                 ||
   6                         $a->[2] <=> $b->[2]
   7                                 ||
   8                         $a->[3] <=> $b->[3]
   9                                 ||
  10                         $a->[4] <=> $b->[4]
  11                         }
  12                          map { [ $_, split /[. ]/ ] }
  13                                 @unsorted;

but it will call back to the coded-in-Perl decision block from the built-in-C Perl runtime for each comparison O(N*logN times), so it still has a speed penalty:

The GRT

The Guttman-Rosler Transform is a special case of the Schwartzian transform that makes strings, not array-refs; the strings contain null-delimited keys in sequence followed by (typically) the original record # and the contents of the record or index into a hash or array. This allows it to be sorted natively, so the O(N*logN) step is the built-in native C implementation of sort, which acts like sort {$a cmp $b} only much faster.

In the great Internet sorting debate between the Schwartzian Transform and the Guttman-Rosler Transform, Merlyn admitted

«Yes. The ST can sort any multilevel complex sort, and is "programmer efficient". When the GRT can be used, and the programmer is willing to invest time to compute the "transform to and from a single string" functions, the GRT may be faster.»

Uri Guttman will regularly reply that in a multilevel sort, GRT will always be faster (barring really odd cases and data sets so small as to be ignored), since sort @array is faster than sort {anyblock} @array, no matter how simple the block: sort with no written-in-Perl callback is Perl core compiled C speed. It all depends on the balance between programmer time and computer time. (If efficiency was the primary objective, and we wanted to still use Perl, we should consider PDL or something similar.)

The Module

But Uri's a fair but balanced guy, so his Sort::Maker module gives you both ST and GRT, and also Joseph Hall's "orcish" maneuver from and good old plain sorting, and builds and compiles the sort idiom for you. Uri's, he gives you all three options, and tools to generate and benchmark them.

Alas, there's a problem with Sort::Maker's POD: it doesn't match the implementation. The POD claims we can use an array ref or a hash for an index's sub arguments, and that we can omit the =>1 for boolean arguments. But it doesn't work. The bold [] on lines 28 and 31 have to be {} and that requires =>1 on booleans too. The following sample is straight from the POD, with the addition of off-screen scaffolding 1..25 and line 35 error output:

  26         my $sorter = make_sorter(
  27                 qw( plain ),
  28                 number => [   
  29                         code => '/(\d+)/',
  30                         'descending',
  31                 ],
  32         ) ;
  33 
  34 
  35  	carp "no sort $@" unless $sorter;
  36 
  37 
  38 	my @sorted = $sorter->( @unsorted ) ;
$ perl uri.pl
no sort make_sorter: Unknown attribute '/(\d+)/' at uri.pl line 35
Use of uninitialized value in subroutine entry at uri.pl line 38.
Can't use string ("") as a subroutine ref while "strict refs" in use at uri.pl line 38.

Always read the warning label.

References

Features

Amongst its diverse weapons of inquisition, Sort::Maker can do

  1. plain no-caching sort,
  2. Orcish maneuver sort;
  3. Schwartzian Transform Sort;
  4. GRT sort, either whole or of indices -- order preserving!
  5. Ability to use module self-tests to benchmark itself;
  6. Ability to sort case (in)sensitive, ascending or descending;
  7. Can return the subroutine reference or install it into namespace of your choice.
  8. Ability to sort unsigned, signed numbers, floats, and fixed and varying strings in the GRT style.
  9. Ability to use PBP recommended anonymous subs (yes!) or eval strings (just say no) for key-extraction callbacks.

Enjoy!

mod23.pl


   1 #! perl -l -w
   2 
   3 use strict;
   4 use warnings;
   5 use Carp;
   6 use Sort::Maker;
   7 
   8  open my $NETSTAT, "netstat -a|"
   9   or die "Netstat fails $@ ";
  10 
  11 my %Connections;
  12 
  13 while (<$NETSTAT>) {
  14   my @F=split /\s+/, $_;
  15   next unless $F[-1] eq 'ESTABLISHED';
  16   my ($local, $remote)=@F[3..4]; # Slice!
  17 
  18   next if $local =~ /[.:][*]/ or $remote =~ /[*]/;
  19   my @L=split /[.:]/, $local;
  20   $remote=~ s/[.:][^.:]+$//;
  21 
  22   $Connections{$remote}->{count}++;
  23 
  24 
  25 
  26 }
  27 
  28 package utility;
  29    { ## Scope for private  var
  30    use Socket; # for AF_INET, inet_aton
  31    my $Zeroes=inet_aton("0.0.0.0");
  32 
  33 sub key_of {
  34   my $addr=shift;
  35   return inet_aton("0.0.0.0") unless defined $addr and $addr;
  36   my $is_numeric =  $addr =~ /^ ([.\d]+) $/;
  37 
  38   if ($is_numeric)  {
  39     # my $name = (gethostbyaddr($addr,AF_INET))[0];
  40     return inet_ntoa($addr);
  41   }
  42   else {
  43     my $num=(gethostbyname($addr))[4];
  44     return $Zeroes unless $num;
  45     return $num;
  46   }
  47 
  48 }
  49 
  50 } # end  scope block
  51 
  52 package main;
  53 use Socket; # for inet_ntoa;
  54 
  55   my $sort_func = 
  56   make_sorter(
  57                            plain=>1 ,
  58                             # name=> 'main::sort_func',
  59                             string => {
  60                                       ascending=>1,
  61                                          unsigned=>1,
  62                                       code => sub { utility::key_of $_ ;
  63 } ,
  64                                      },
  65                           )
  66 
  67   or croak "no sort: $@";
  68 
  69 
  70   my $name;
  71   for $name ($sort_func->(keys %Connections)) {
  72     my $addr=inet_ntoa(utility::key_of ($name));
  73     print "$Connections{$name}->{count}\t$name\t$addr";
  74 
  75   }