The 2003 Perl Advent Calendar
[about] | [archives] | [contact] | [home]

On the 20th day of Advent my True Language brought to me..
Regex::Common

Are you fed up writing the same regexes over and over again? Even though someone's bound to have written (and debugged) them a hundred times already.

Someone should put a module of a collection of them up on the CPAN. Oh, wait, someone did.

Suppose you want to make sure a scalar has something that looks like a number in it. It's a fairly simple regex to write, right?

  $scalar =~ /^\d+$/;

That's

  $scalar =~ /^    # start of line 
              \d   # digit
              +    # one or more times
              $    # till the end of line
             /x;   # allow me to split the line up like this

Of course, that falls over as soon as someone puts in a floating point number.

   3.14159265   # the dot doesn't match \d

So we need to expand that to cover situations where there might optionally be extra bits on the end

  $scalar =~ /^    # start of line 
              \d   # digit
              +    # one or more times
              (    # group for floating point part
               \.  # literal dot
               \d  # digit
               +   # one or more times
              )    # end group for floating point type
              ?    # group may or may not exist (is optional)
              $    # till the end of line
             /x;   # allow me to split the line up like this

Which works fine until someone does this:

   -2.71828183

So we have to modify it to have an optional plus or minus sign at the start:

  $scalar =~ /^    # start of line 
              [+-] # plus or minus
              ?    # which is optional
              \d   # digit
              +    # one or more times
              (    # group for floating point part
               \.  # literal dot
               \d  # digit
               +   # one or more times
              )    # end group for floating point type
              ?    # group may or may not exist (is optional)
              $    # till the end of line
             /x;   # allow me to split the line up like this

And guess what...then someone writes this:

   6.626068e10-34

And we get really annoyed. At this point I'm writing so much code that I have the distinct urge to write some tests. But more than this I get to thinking...wouldn't it be nice if someone had written this already. It's a fairly common occurrence - it's not like we're the first people ever to want to match a number.

And then we look in Regex::Common. Lo and behold! There's one there to do it! Remind me again why I'm writing my own code?

Using Regex::Common exports a hash %RE into our namespace. This hash contains many compiled regexes which we an use in our regular expressions. For example:

    $scalar =~ /$RE{num}{real}/;

Regex::Common also provides a subroutine method to get at the regexes, if you prefer to use it like that:

    use Regex::Common 'RE_ALL';
    my $regex = RE_num_real();
    if ($scalar =~ $regex)
     { print "It matched!" }

In either case the regexes are blessed, meaning you can call methods on them and treat them just like they're objects.

    my $num_regex = $RE{num}{real};
    if ($num_regex->match($scalar))
      { print "It matched!" }

One thing you can say about Regex::Common, it provides a lot of syntactic sugar.

What Regex::Common can match

I'm not going to provide examples of everything that Regex::Common can match - that would take forever and a day. I'm just going to touch on some of the things that I've found most useful.

Aside from number matching, the one regular expression set I've found the most useful is the profanity matching. This is impossible to do properly without really annoying the residents of Middlesex and Scunthorpe by blocking out the inappropriate words in their place names, and you can only provide basic checking that's 'good enough'. Regex::Common provides a collection that's 'good enough' from the outset, and means I no longer have to worry about constructing such things.

There's one or two regexes in the collection that I could easily write but are really tiresome to do each time and - as always when you write code rather than reusing existing known good code - you run the risk of making a mistake or typo; The ones that spring particularly to mind are the code for removing whitespace from the start or end of strings, and the code for removing comments from text.

Straying onto more advanced territory there's even code for matching balanced brackets, something that strictly in a mathematical sense a regular expression shouldn't be able to do (but Perl can because it's regular expressions aren't that regular.)

Then there's some clever stuff in there to match lists, where you can have things like "rod, jane, and freddy" and get the results back carefully dumping things like "and".

I could go on all day like this...have a look around in the

  • list
  • of modules yourself.

  • Regex::Common::num
  • Regex::Common::profanity
  • Regex::Common::whitespace
  • Regex::Common::comment
  • Regex::Common::balanced
  • Regex::Common::list