Benchmarking with Bencher

Bencher - 2016-12-03

Santa had a problem. Due to increased use of machine transcription, his list of children's names contained far more misspellings than previously, and really had to be checked twice. Santa ordered one of the elves to write a Perl script for this task, but there was a catch - with more than a billion names on the list, the script needed to be fast.

The elf started off by evaluating several modules on CPAN that calculate Levenshtein edit distance (among others: Text::Levenshtein, Text::Levenshtein::XS, Text::Levenshtein::Flexible, Text::LevenshteinXS) and trying to pick one to use for his script, preferably the fastest one.

"I'll simply write a benchmark script to find out which one is the fastest," he thought to himself. Normally, that script would have used the built-in Benchmark module, like this:

use Benchmark 'cmpthese';

use Text::Levenshtein ();
use Text::Levenshtein::XS ();
use Text::Levenshtein::Flexible ();
use Text::LevenshteinXS ();

cmpthese(
    100_000,
    {
        'Text::Levenshtein' => sub { Text::Levenshtein::fastdistance("foo", "bar") },
        'Text::Levenshtein::XS' => sub { Text::Levenshtein::XS::distance("foo", "bar") },
        'Text::Levenshtein::Flexible' => sub { Text::Levenshtein::Flexible::levenshtein("foo", "bar") },
        'Text::LevenshteinXS' => sub { Text::LevenshteinXS::distance("foo", "bar") },
    }
);

but I tricked him into, er, suggested, trying the Bencher framework for a change. So here's what he wrote instead:

# lib/Bencher/Scenario/Levenshtein.pm
package Bencher::Scenario::Levenshtein;
our $scenario = {
    summary => 'Benchmark modules that calculate Levenshtein edit distance',
    participants => [
        {fcall_template => "Text::Levenshtein::fastdistance(<word1>, <word2>)"},
        {fcall_template => "Text::Levenshtein::XS::distance(<word1>, <word2>)"},
        {fcall_template => "Text::Levenshtein::Flexible::levenshtein(<word1>, <word2>)"},
        {fcall_template => "Text::LevenshteinXS::distance(<word1>, <word2>)"},
    ],
    datasets => [
        { name => "foo", args => {word1=>"foo", word2=>"bar"}, result => 3 },
    ],
};

What's different between the two? First of all, the script is turned into a module containing a data structure called scenario. The code snippets, called participants, are turned into code templates with variables written inside angle brackets like this: <name>. The variable values are put in the datasets key.

How do we run this scenario module? Install the bencher-tiny script from the Bencher-Tiny distribution:

   % cpanm -n Bencher::Tiny

then run:

   % PERL5OPT=-Ilib bencher-tiny -c 100000 Levenshtein

The output will be identical to the output of the first script we saw, because bencher-tiny also uses Benchmark to benchmark the code:

               (warning: too few iterations for a reliable count)
               (warning: too few iterations for a reliable count)
               (warning: too few iterations for a reliable count)
                                                 Rate Text::Levenshtein::fastdistance Text::Levenshtein::XS::distance Text::LevenshteinXS::distance Text::Levenshtein::Flexible::levenshtein
   Text::Levenshtein::fastdistance            52083/s                              --                            -92%                          -99%                                     -99%
   Text::Levenshtein::XS::distance           666667/s                           1180%                              --                          -87%                                     -87%
   Text::LevenshteinXS::distance            5000000/s                           9500%                            650%                            --                                      -0%
   Text::Levenshtein::Flexible::levenshtein 5000000/s                           9500%                            650%                            0%                                       --

However, turning our benchmark script into a scenario module means we can do a lot more things with it. First of all, let's use the full-featured CLI bencher (from the Bencher distribution) instead of bencher-tiny. Install it from CPAN (this might take a while, as it has quite a lot of dependencies):

   % cpanm -n Bencher

then run:

   % bencher -Ilib -m Levenshtein
   +------------------------------------------+-----------+-----------+------------+---------+---------+
   | participant                              | rate (/s) | time (us) | vs_slowest |  errors | samples |
   +------------------------------------------+-----------+-----------+------------+---------+---------+
   | Text::Levenshtein::fastdistance          |     51000 |    20     |        1   | 3.3e-08 |      20 |
   | Text::Levenshtein::XS::distance          |    757000 |     1.32  |       14.8 | 1.8e-10 |      20 |
   | Text::LevenshteinXS::distance            |   8500000 |     0.12  |      170   | 2.4e-10 |      20 |
   | Text::Levenshtein::Flexible::levenshtein |   8850000 |     0.113 |      173   | 1.1e-10 |      20 |
   +------------------------------------------+-----------+-----------+------------+---------+---------+

You'll notice several things are different. Instead of Benchmark, the bencher CLI by default uses Dumbbench to benchmark the code. It then presents the results as a table.

You'll also notice that the script returns much more quickly, and the result is more accurate for the faster participants. Recall that Benchmark complained above that we didn't use enough iterations for a reliable count. To avoid this warning, we would need to set count to something like 3_000_000 - but imagine how long it would take for the benchmark to run in this case (~1 minute, because Text::Levenshtein can only perform ~50k calculations per second). By contrast, if you look at the samples result field, you'll see that Dumbbench only needs about 20 runs for each participant. You don't actually even need to set the count parameter, because it will figure out the minimum sufficient number of runs.

Aside from this difference in output, there are quite a number of other things we can do.

Adding more datasets

Remember how we split the code and data when we constructed the scenario? The benefit of doing this is that we can easily add more data. Let's say we want to measure performance for some longer word. We'll just add this to our datasets:

{ name => "program", args => {word1=>"program", word2=>"porgram"}, result => 2 },

then run:

   % bencher -Ilib -m Levenshtein
   +------------------------------------------+---------+-----------+-----------+------------+---------+---------+
   | participant                              | dataset | rate (/s) | time (us) | vs_slowest |  errors | samples |
   +------------------------------------------+---------+-----------+-----------+------------+---------+---------+
   | Text::Levenshtein::fastdistance          | program |     11000 |    89     |        1   | 1.1e-07 |      20 |
   | Text::Levenshtein::fastdistance          | foo     |     52000 |    19     |        4.7 | 3.3e-08 |      20 |
   | Text::Levenshtein::XS::distance          | program |    480000 |     2.1   |       43   | 3.3e-09 |      20 |
   | Text::Levenshtein::XS::distance          | foo     |    738000 |     1.36  |       65.7 | 4.2e-10 |      20 |
   | Text::LevenshteinXS::distance            | program |   3180000 |     0.314 |      284   | 9.7e-11 |      28 |
   | Text::Levenshtein::Flexible::levenshtein | program |   4170000 |     0.24  |      371   | 1.7e-10 |      25 |
   | Text::LevenshteinXS::distance            | foo     |   7300000 |     0.137 |      650   | 4.5e-11 |      20 |
   | Text::Levenshtein::Flexible::levenshtein | foo     |   7660000 |     0.131 |      682   | 4.6e-11 |      20 |
   +------------------------------------------+---------+-----------+-----------+------------+---------+---------+

There's now a dataset result field, since we are running with multiple datasets.

Filtering datasets, participants, modules, etc

To use only one specific dataset:

   % bencher -Ilib -m Levenshtein --include-dataset program

There are similar options to include only certain participants: --include-participant, --exclude-participant, --include-participant-pattern, and so on. We can also include/exclude certain modules. For example, let's just exclude all the pure-Perl modules because they have no hope of competing with XS:

   % bencher -Ilib -m Levenshtein --include-dataset program --nopp
   +------------------------------------------+-----------+-----------+------------+---------+---------+
   | participant                              | rate (/s) | time (us) | vs_slowest |  errors | samples |
   +------------------------------------------+-----------+-----------+------------+---------+---------+
   | Text::Levenshtein::XS::distance          |    410000 |     2.5   |        1   | 3.3e-09 |      20 |
   | Text::LevenshteinXS::distance            |   2800000 |     0.357 |        6.9 | 3.3e-10 |      20 |
   | Text::Levenshtein::Flexible::levenshtein |   3600000 |     0.28  |        8.9 | 4.3e-10 |      24 |
   +------------------------------------------+-----------+-----------+------------+---------+---------+

There are other kinds of filtering available, for example by tags, sequence, and so on.

Instead of running the benchmark, you can also verify or inspect the participants (--list-participants) and datasets (--list-datasets), or just run the code once and display the result (--show-items-results):

   % bencher -Ilib -m Levenshtein --nopp --show-items-results
   #0 (dataset=foo participant=Text::Levenshtein::XS::distance):
   3

   #1 (dataset=program participant=Text::Levenshtein::XS::distance):
   2

   #2 (dataset=foo participant=Text::Levenshtein::Flexible::levenshtein):
   3

   #3 (dataset=program participant=Text::Levenshtein::Flexible::levenshtein):
   2

   #4 (dataset=foo participant=Text::LevenshteinXS::distance):
   3

   #5 (dataset=program participant=Text::LevenshteinXS::distance):
   2

Checking the results first

Notice that in each dataset, we added this:

result => 2

or:

result => 3

This parameter is optional, but if we specify a value here then bencher will first compare it to the results of running the code, to make sure that the code we are benchmarking returns the correct result. Fast but wrong code is useless, after all.

More features

Bencher can do plenty of other things, for example:

Benchmark module startup overhead
Benchmark against multiple perls
Benchmark against multiple module versions
Show data structure size and memory usage
Present the results as a chart or graph
Include CPU/other system information
Return raw structured data (--json) for easy manipulation or transport to servers

I've also written plugins for Dist::Zilla and other CLI tools related to Bencher. For one-off benchmarking this might not mean much, but if you regularly use benchmarking when developing (for example to watch out for performance regression), Bencher can be a useful addition to your toolbox.

Perl Advent Calendar 2016

Benchmarking with Bencher

Adding more datasets

Filtering datasets, participants, modules, etc

Checking the results first

More features

SEE ALSO