Triple Store

Notice

This essay was originally written about a piece of software that I was commissioned to design for Radar Networks. It was delivered to the W3C late-November 2003 at a conference in Amsterdam. The project was ultimately cancelled due to personality conflicts between staff members, the result of which was that I re-built the software in C for my personal use.

Definition

A triple store is designed to store and retrieve identities that are constructed from triplex collections of strings (sequences of letters). These triplex collections represent a subject-predicate-object relationship that more or less corresponds to the definition put forth by the RDF standard.

The problem space of storing this sort of data has been explored by the graph database, object database, PROLOG language and, more recently, semantic web communities. A more thorough backgrounder is provided by the work on Datalog, Jena, Dave Beckett’s Redland, the AT&T Research Communities of Interest project, Ora Lassila’s Wilbur, and the activities of the W3C Semantic Web project, among others.

Set Theory

It turns out that this logical basis for storing and querying relationships is nearly ideally suited to the human-friendly indexing of textual data (this web site, for instance). The basic properties of such a system are set theoretical, following the rules prescribed here:

Let T = { t1, t2, t3, ... } (the set of all triples)

Let t = ƒ( s, p, o ) such that:

Performance

The performance of most similar systems is bounded by the speed of an external relational database (with the attendant overhead of IPC and double buffering betwixt the two processes) or a link-library (usually Berkeley DB) that is tuned for good general case performance, but which is not tuned for this workload. My approach has been the creation of a from-scratch database with an on-disk layout that’s designed specifically for this purpose.

The hard part, as is always the case in computer science, is less the creation of a conformant system than the creation of a conformant system that is also performant. I have thus far achieved this order of complexity for basic searches:

Let N = the number of triples.

Let M = the number of triples matched.

However, this sort of classical analysis isn’t the whole story. The underlying structures include B+Trees that attempt to pack as much data as possible onto each B-Page in order to minimize the number of disk seeks required to collect the requested data and maximize the efficiency of the page cache maintained by the database.

Graph nodes store long integers that represent the indices of string values in an on-disk table. This is more efficient in time and space, and allows a number of optimizations based on persistent tries to reduce the size of the string table. A random triple can be retrieved, on a cold cache, in no more than four disk seeks. Organic workloads perform very well because the system attempts to place data together on disk so that, for instance, the string table for OWL Lite can be stored in a few consecutive data pages that are kept in cache by frequent accesses.

[del.icio.us] [Digg] [Facebook] [Google] [Reddit] [StumbleUpon] [Technorati] [Twitter]