Semantic Perl

RDF::Trine and RDF::Query - 2013-12-01

SEMANTIC PERL

Oh, what was that Christmas song I was trying to remember? It was a duet... I know that. Wikipedia's "Christmas songs" list is enormous; I don't want to click through each of those to find it.

Anyway, I'll forget about that for now; time for a Perl Advent...

The Triple

The semantic web is built around the idea of the "triple". The idea of the triple is so simple that a three year old can grasp it. (Trust me.)

To meaningfully describe anything we need at least three pieces of information:

We need to know what thing is being described.
We need to specify an attribute of the thing.
We need to provide a value for that attribute.

For example, assuming we have two Person objects, in Perl we might write something like:

### Three triples...
$person1->name("Alice");     # $person1's name is "Alice".
$person2->name("Bob");       # $person2's name is "Bob".
$person1->spouse($person2);  # $person1's spouse is $person2.

If you've ever told anyone your name, then you've used a triple.

More complex descriptions might be temporally or geographically qualified (Alice's spouse hasn't been Bob since the dawn of time), but information theory tells us that complex facts can be broken down into multiple triples, generally by taking a relationship and turning it into a "thing" in its own right:

$marriage->participant1($person1);
$marriage->participant2($person2);
$marriage->start($wedding_date);
$marriage->legally_recognised_in(@countries);

So that's the triple. Three components. Any fewer would be useless; any more are just sugar.

A little terminology: in semantic web circles these three components are generally referred to as the subject, predicate and object respectively.

The Resource

The semantic web takes the concept of the triple and marries it with the concept of "resources".

What is a resource? A resource is anything that can be identified with a uniform resource identifier (URI). (If you're unfamiliar with URIs, think URL, or "web address". They're almost the same thing. Almost.) This is a somewhat circular definition.

https://metacpan.org/ is a URI. It is an identifier for a resource. That particular identifier happens to identify a resource which is a document. There are plenty of other URIs that identify documents, images, videos, mailboxes, FTP sites, and so forth on the Internet.

But nothing in the URI specifications places any limits on what kind of things may be identified by a URI. You could generate a URI to identify a car, or a planet, or a person.

While you can "dereference" https://metacpan.org/ to get (a copy of) the actual document identified by that URI, you cannot expect to dereference a URI representing a car to retrieve an actual car. This is the difference between URLs and URIs. URLs are locators; URIs are mere identifiers. URLs are a subset of URIs.

URIs give us a global namespace for identifying things. So in the semantic web, we have triples where each component is a resource (i.e. something that can be identified by a URI).

It is fine to have multiple URIs identifying the same thing (think multiple references to the same object in Perl), but multiple things identified by the same URI leads to ambiguity.

This model of triples of resources is referred to as the Resource Description Framework (snappy name, right?) or RDF.

My First Triple

Let's assume that I've chosen the URI http://tobyinkster.co.uk/#i to identify myself. The inclusion of #i means that it's different from the URL that identifies my homepage. (Remember, multiple things identified by the same URI leads to ambiguity; so the URI I use to identify myself must be different to the URI for my homepage.) Now I can state my name in RDF like this:

 <http://tobyinkster.co.uk/#i>
    <http://xmlns.com/foaf/0.1/name>
       "Toby" .

The second component there is a URI defined by the FOAF specification to identify the concept of having a name,

Note that the last component there is not a URI. A resource is anything that can be identified by a URI, but that doesn't always mean that it must be identified by a URI. Sometimes it's useful for a string to just be a string.

Introducing RDF::Trine and RDF::Query

While there are several other semantic web tool-kits on CPAN, RDF::Trine has emerged as the defacto standard.

RDF::Trine provides an object-oriented interface for dealing with resources and triples; persistent stores for triples, along with some basic methods for querying them; parsers and serialisers for various triple-based file formats (RDF/XML, Turtle, etc).

RDF::Trine is often used in conjunction with RDF::Query which provides an implementation of SPARQL, an SQL-like query and update language for triple stores. The author of RDF::Trine and RDF::Query is involved in the W3C group standardising SPARQL 1.1, and as a result his Perl libraries are very high quality and standards-compliant implementations.

Here's an example of using the previous triple in RDF::Trine:

use 5.010;
use strict;
use RDF::Trine qw( statement iri literal );
use RDF::Query;

# Create the triple.
#
my $triple = statement(
   iri("http://tobyinkster.co.uk/#i"),
   iri("http://xmlns.com/foaf/0.1/name"),
   literal("Toby"),
);

# Let's take a look at that.
#
say $triple->as_string;

# Create an in-memory triple database for querying...
#
my $model = RDF::Trine::Model->new;

# ... and add our triple.
#
$model->add_statement($triple);

# I told you it was SQL-like...
#
my $query = RDF::Query->new(<<'SPARQL');
SELECT $name
WHERE {
   <http://tobyinkster.co.uk/#i>
      <http://xmlns.com/foaf/0.1/name>
         $name .
}
SPARQL

# Run the query against the database
#
my $results = $query->execute($model);

# This is an iterator, a bit like DBI results
#
while (my $row = $results->next) {
   say "GOT ", $row->{name}->value;
}

What is it Good For?

RDF is a good way of representing highly heterogeneous data.

Imagine that you're storing the following data in an SQL database:

Alice is called Alice
Bob is called Bob
Alice is married to Bob

Easy peasy, right? You can probably already envisage the tables you'd use. Now let's store some more data:

Alice takes a drug called Foo-lin.
Foo-lin is manufactured by ACME Corp.
Foo-lin is a variety of insulin.
Bob likes "Fight Club".
"Fight Club" is a film.
"Fight Club" stars Brad Pitt.
Brad Pitt is called Brad.

Ten facts, and already the tables are spiralling out of control. Using RDF all that is just ten triples.

RDF allows you to model messy data; allows you to merge data from different data sources which have different models of the world. As long as two RDF data sources agree on URIs to identify things with, merging two databases can be as easy as concatenating two lists of triples.

All You Can Eat Data Buffet

There are over 30 billion RDF triples out there on the Internet. Because RDF uses URIs to identify things, and different data sources can use each others' URIs, this forms a loosely interlinked "web of data". (Just as web pages link to each other to form a "web of documents".)

DBpedia, a project to lift Wikipedia's information into RDF, might indicate the location for a particular World War II battle using a URI from the GeoNames project. Status.net, an open source microblogging platform, uses the same predicate URIs to describe people as LiveJournal.

I won't say that it's easy to use the entire web of data as a massive distributed database, but it's beginning to look possible.

For our conclusion, let's focus on just one of those linked open data projects: DBpedia.

The SPARQL Protocol

(Yes, that sounds like a new Dan Brown book.)

As well as being an SQL-like query language, SPARQL also includes a RESTful HTTP protocol for querying remote RDF data stores.

DBpedia's SPARQL endpoint is http://dbpedia.org/sparql. Let's query it using RDF::Query::Client.

use 5.010;
use strict;
use RDF::Query::Client;
use constant DBPEDIA => 'http://dbpedia.org/sparql';

# Notice the "PREFIX" keyword in SPARQL can be used to set up
# abbreviations for long URIs.
#
my $query = RDF::Query::Client->new(<<'SPARQL');
PREFIX category: <http://dbpedia.org/resource/Category:>
PREFIX dc:       <http://purl.org/dc/terms/>
PREFIX rdfs:     <http://www.w3.org/2000/01/rdf-schema#>
SELECT $title
WHERE {
   $resource dc:subject category:Christmas_songs .
   $resource dc:subject category:Vocal_duets .
   $resource rdfs:label $title .
   FILTER ( langMatches(lang($title), "en") )
}
SPARQL

my $results = $query->execute( DBPEDIA );
while (my $row = $results->next) {
   say "GOT ", $row->{title}->value;
}

This gives me a list of five songs; much easier to narrow down the song I was looking for. Oh yes, that was it...

Perl Advent Calendar 2013